Project: Identify Customer Segments¶
In this project, you will apply unsupervised learning techniques to identify segments of the population that form the core customer base for a mail-order sales company in Germany. These segments can then be used to direct marketing campaigns towards audiences that will have the highest expected rate of returns. The data that you will use has been provided by our partners at Bertelsmann Arvato Analytics, and represents a real-life data science task.
This notebook will help you complete this task by providing a framework within which you will perform your analysis steps. In each step of the project, you will see some text describing the subtask that you will perform, followed by one or more code cells for you to complete your work. Feel free to add additional code and markdown cells as you go along so that you can explore everything in precise chunks. The code cells provided in the base template will outline only the major tasks, and will usually not be enough to cover all of the minor tasks that comprise it.
It should be noted that while there will be precise guidelines on how you should handle certain tasks in the project, there will also be places where an exact specification is not provided. There will be times in the project where you will need to make and justify your own decisions on how to treat the data. These are places where there may not be only one way to handle the data. In real-life tasks, there may be many valid ways to approach an analysis task. One of the most important things you can do is clearly document your approach so that other scientists can understand the decisions you've made.
At the end of most sections, there will be a Markdown cell labeled Discussion. In these cells, you will report your findings for the completed section, as well as document the decisions that you made in your approach to each subtask. Your project will be evaluated not just on the code used to complete the tasks outlined, but also your communication about your observations and conclusions at each stage.
# import libraries here; add more as necessary
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import re
from scipy import stats
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from collections import Counter
from yellowbrick.cluster import KElbowVisualizer
pd.options.display.max_rows = None
pd.options.display.max_columns = None
pd.set_option("future.no_silent_downcasting", True)
!pip install yellowbrick
Requirement already satisfied: yellowbrick in /Users/chrismo/opt/anaconda3/lib/python3.11/site-packages (1.5) Requirement already satisfied: matplotlib!=3.0.0,>=2.0.2 in /Users/chrismo/opt/anaconda3/lib/python3.11/site-packages (from yellowbrick) (3.8.4) Requirement already satisfied: scipy>=1.0.0 in /Users/chrismo/opt/anaconda3/lib/python3.11/site-packages (from yellowbrick) (1.13.1) Requirement already satisfied: scikit-learn>=1.0.0 in /Users/chrismo/opt/anaconda3/lib/python3.11/site-packages (from yellowbrick) (1.5.1) Requirement already satisfied: numpy>=1.16.0 in /Users/chrismo/opt/anaconda3/lib/python3.11/site-packages (from yellowbrick) (1.26.4) Requirement already satisfied: cycler>=0.10.0 in /Users/chrismo/opt/anaconda3/lib/python3.11/site-packages (from yellowbrick) (0.11.0) Requirement already satisfied: contourpy>=1.0.1 in /Users/chrismo/opt/anaconda3/lib/python3.11/site-packages (from matplotlib!=3.0.0,>=2.0.2->yellowbrick) (1.2.0) Requirement already satisfied: fonttools>=4.22.0 in /Users/chrismo/opt/anaconda3/lib/python3.11/site-packages (from matplotlib!=3.0.0,>=2.0.2->yellowbrick) (4.51.0) Requirement already satisfied: kiwisolver>=1.3.1 in /Users/chrismo/opt/anaconda3/lib/python3.11/site-packages (from matplotlib!=3.0.0,>=2.0.2->yellowbrick) (1.4.4) Requirement already satisfied: packaging>=20.0 in /Users/chrismo/opt/anaconda3/lib/python3.11/site-packages (from matplotlib!=3.0.0,>=2.0.2->yellowbrick) (24.1) Requirement already satisfied: pillow>=8 in /Users/chrismo/opt/anaconda3/lib/python3.11/site-packages (from matplotlib!=3.0.0,>=2.0.2->yellowbrick) (10.4.0) Requirement already satisfied: pyparsing>=2.3.1 in /Users/chrismo/opt/anaconda3/lib/python3.11/site-packages (from matplotlib!=3.0.0,>=2.0.2->yellowbrick) (3.0.9) Requirement already satisfied: python-dateutil>=2.7 in /Users/chrismo/opt/anaconda3/lib/python3.11/site-packages (from matplotlib!=3.0.0,>=2.0.2->yellowbrick) (2.9.0.post0) Requirement already satisfied: joblib>=1.2.0 in /Users/chrismo/opt/anaconda3/lib/python3.11/site-packages (from scikit-learn>=1.0.0->yellowbrick) (1.4.2) Requirement already satisfied: threadpoolctl>=3.1.0 in /Users/chrismo/opt/anaconda3/lib/python3.11/site-packages (from scikit-learn>=1.0.0->yellowbrick) (3.5.0) Requirement already satisfied: six>=1.5 in /Users/chrismo/opt/anaconda3/lib/python3.11/site-packages (from python-dateutil>=2.7->matplotlib!=3.0.0,>=2.0.2->yellowbrick) (1.16.0)
Step 0: Load the Data¶
There are four files associated with this project (not including this one):
Udacity_AZDIAS_Subset.csv: Demographics data for the general population of Germany; 891211 persons (rows) x 85 features (columns).Udacity_CUSTOMERS_Subset.csv: Demographics data for customers of a mail-order company; 191652 persons (rows) x 85 features (columns).Data_Dictionary.md: Detailed information file about the features in the provided datasets.AZDIAS_Feature_Summary.csv: Summary of feature attributes for demographics data; 85 features (rows) x 4 columns
Each row of the demographics files represents a single person, but also includes information outside of individuals, including information about their household, building, and neighborhood. You will use this information to cluster the general population into groups with similar demographic properties. Then, you will see how the people in the customers dataset fit into those created clusters. The hope here is that certain clusters are over-represented in the customers data, as compared to the general population; those over-represented clusters will be assumed to be part of the core userbase. This information can then be used for further applications, such as targeting for a marketing campaign.
To start off with, load in the demographics data for the general population into a pandas DataFrame, and do the same for the feature attributes summary. Note for all of the .csv data files in this project: they're semicolon (;) delimited, so you'll need an additional argument in your read_csv() call to read in the data properly. Also, considering the size of the main dataset, it may take some time for it to load completely.
Once the dataset is loaded, it's recommended that you take a little bit of time just browsing the general structure of the dataset and feature summary file. You'll be getting deep into the innards of the cleaning in the first major step of the project, so gaining some general familiarity can help you get your bearings.
# Load in the general demographics data.
pop = pd.read_csv('Udacity_AZDIAS_Subset.csv', sep=';')
# Load in the feature summary file.
feat_info = pd.read_csv('AZDIAS_Feature_Summary.csv', sep=';')
# Checking the structure of the data after it's loaded (i.e. print the number of
# rows and columns, print the first few rows).
print(f'The Population dataset has {pop.shape[0]} rows and {pop.shape[1]} columns')
The Population dataset has 891221 rows and 85 columns
print(f'The Features dataset has {feat_info.shape[0]} rows and {feat_info.shape[1]} columns')
The Features dataset has 85 rows and 4 columns
pop.head()
| AGER_TYP | ALTERSKATEGORIE_GROB | ANREDE_KZ | CJT_GESAMTTYP | FINANZ_MINIMALIST | FINANZ_SPARER | FINANZ_VORSORGER | FINANZ_ANLEGER | FINANZ_UNAUFFAELLIGER | FINANZ_HAUSBAUER | FINANZTYP | GEBURTSJAHR | GFK_URLAUBERTYP | GREEN_AVANTGARDE | HEALTH_TYP | LP_LEBENSPHASE_FEIN | LP_LEBENSPHASE_GROB | LP_FAMILIE_FEIN | LP_FAMILIE_GROB | LP_STATUS_FEIN | LP_STATUS_GROB | NATIONALITAET_KZ | PRAEGENDE_JUGENDJAHRE | RETOURTYP_BK_S | SEMIO_SOZ | SEMIO_FAM | SEMIO_REL | SEMIO_MAT | SEMIO_VERT | SEMIO_LUST | SEMIO_ERL | SEMIO_KULT | SEMIO_RAT | SEMIO_KRIT | SEMIO_DOM | SEMIO_KAEM | SEMIO_PFLICHT | SEMIO_TRADV | SHOPPER_TYP | SOHO_KZ | TITEL_KZ | VERS_TYP | ZABEOTYP | ALTER_HH | ANZ_PERSONEN | ANZ_TITEL | HH_EINKOMMEN_SCORE | KK_KUNDENTYP | W_KEIT_KIND_HH | WOHNDAUER_2008 | ANZ_HAUSHALTE_AKTIV | ANZ_HH_TITEL | GEBAEUDETYP | KONSUMNAEHE | MIN_GEBAEUDEJAHR | OST_WEST_KZ | WOHNLAGE | CAMEO_DEUG_2015 | CAMEO_DEU_2015 | CAMEO_INTL_2015 | KBA05_ANTG1 | KBA05_ANTG2 | KBA05_ANTG3 | KBA05_ANTG4 | KBA05_BAUMAX | KBA05_GBZ | BALLRAUM | EWDICHTE | INNENSTADT | GEBAEUDETYP_RASTER | KKK | MOBI_REGIO | ONLINE_AFFINITAET | REGIOTYP | KBA13_ANZAHL_PKW | PLZ8_ANTG1 | PLZ8_ANTG2 | PLZ8_ANTG3 | PLZ8_ANTG4 | PLZ8_BAUMAX | PLZ8_HHZ | PLZ8_GBZ | ARBEIT | ORTSGR_KLS9 | RELAT_AB | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -1 | 2 | 1 | 2.0 | 3 | 4 | 3 | 5 | 5 | 3 | 4 | 0 | 10.0 | 0 | -1 | 15.0 | 4.0 | 2.0 | 2.0 | 1.0 | 1.0 | 0 | 0 | 5.0 | 2 | 6 | 7 | 5 | 1 | 5 | 3 | 3 | 4 | 7 | 6 | 6 | 5 | 3 | -1 | NaN | NaN | -1 | 3 | NaN | NaN | NaN | 2.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1 | -1 | 1 | 2 | 5.0 | 1 | 5 | 2 | 5 | 4 | 5 | 1 | 1996 | 10.0 | 0 | 3 | 21.0 | 6.0 | 5.0 | 3.0 | 2.0 | 1.0 | 1 | 14 | 1.0 | 5 | 4 | 4 | 3 | 1 | 2 | 2 | 3 | 6 | 4 | 7 | 4 | 7 | 6 | 3 | 1.0 | 0.0 | 2 | 5 | 0.0 | 2.0 | 0.0 | 6.0 | NaN | 3.0 | 9.0 | 11.0 | 0.0 | 8.0 | 1.0 | 1992.0 | W | 4.0 | 8 | 8A | 51 | 0.0 | 0.0 | 0.0 | 2.0 | 5.0 | 1.0 | 6.0 | 3.0 | 8.0 | 3.0 | 2.0 | 1.0 | 3.0 | 3.0 | 963.0 | 2.0 | 3.0 | 2.0 | 1.0 | 1.0 | 5.0 | 4.0 | 3.0 | 5.0 | 4.0 |
| 2 | -1 | 3 | 2 | 3.0 | 1 | 4 | 1 | 2 | 3 | 5 | 1 | 1979 | 10.0 | 1 | 3 | 3.0 | 1.0 | 1.0 | 1.0 | 3.0 | 2.0 | 1 | 15 | 3.0 | 4 | 1 | 3 | 3 | 4 | 4 | 6 | 3 | 4 | 7 | 7 | 7 | 3 | 3 | 2 | 0.0 | 0.0 | 1 | 5 | 17.0 | 1.0 | 0.0 | 4.0 | NaN | 3.0 | 9.0 | 10.0 | 0.0 | 1.0 | 5.0 | 1992.0 | W | 2.0 | 4 | 4C | 24 | 1.0 | 3.0 | 1.0 | 0.0 | 0.0 | 3.0 | 2.0 | 4.0 | 4.0 | 4.0 | 2.0 | 3.0 | 2.0 | 2.0 | 712.0 | 3.0 | 3.0 | 1.0 | 0.0 | 1.0 | 4.0 | 4.0 | 3.0 | 5.0 | 2.0 |
| 3 | 2 | 4 | 2 | 2.0 | 4 | 2 | 5 | 2 | 1 | 2 | 6 | 1957 | 1.0 | 0 | 2 | 0.0 | 0.0 | 0.0 | 0.0 | 9.0 | 4.0 | 1 | 8 | 2.0 | 5 | 1 | 2 | 1 | 4 | 4 | 7 | 4 | 3 | 4 | 4 | 5 | 4 | 4 | 1 | 0.0 | 0.0 | 1 | 3 | 13.0 | 0.0 | 0.0 | 1.0 | NaN | NaN | 9.0 | 1.0 | 0.0 | 1.0 | 4.0 | 1997.0 | W | 7.0 | 2 | 2A | 12 | 4.0 | 1.0 | 0.0 | 0.0 | 1.0 | 4.0 | 4.0 | 2.0 | 6.0 | 4.0 | 0.0 | 4.0 | 1.0 | 0.0 | 596.0 | 2.0 | 2.0 | 2.0 | 0.0 | 1.0 | 3.0 | 4.0 | 2.0 | 3.0 | 3.0 |
| 4 | -1 | 3 | 1 | 5.0 | 4 | 3 | 4 | 1 | 3 | 2 | 5 | 1963 | 5.0 | 0 | 3 | 32.0 | 10.0 | 10.0 | 5.0 | 3.0 | 2.0 | 1 | 8 | 5.0 | 6 | 4 | 4 | 2 | 7 | 4 | 4 | 6 | 2 | 3 | 2 | 2 | 4 | 2 | 2 | 0.0 | 0.0 | 2 | 4 | 20.0 | 4.0 | 0.0 | 5.0 | 1.0 | 2.0 | 9.0 | 3.0 | 0.0 | 1.0 | 4.0 | 1992.0 | W | 3.0 | 6 | 6B | 43 | 1.0 | 4.0 | 1.0 | 0.0 | 0.0 | 3.0 | 2.0 | 5.0 | 1.0 | 5.0 | 3.0 | 3.0 | 5.0 | 5.0 | 435.0 | 2.0 | 4.0 | 2.0 | 1.0 | 2.0 | 3.0 | 3.0 | 4.0 | 6.0 | 5.0 |
feat_info.head()
| attribute | information_level | type | missing_or_unknown | |
|---|---|---|---|---|
| 0 | AGER_TYP | person | categorical | [-1,0] |
| 1 | ALTERSKATEGORIE_GROB | person | ordinal | [-1,0,9] |
| 2 | ANREDE_KZ | person | categorical | [-1,0] |
| 3 | CJT_GESAMTTYP | person | categorical | [0] |
| 4 | FINANZ_MINIMALIST | person | ordinal | [-1] |
def check_value_counts(pop:pd.DataFrame=pop)-> None:
"""
INPUT: pop dataframe
OUTPUT: examines values for each column. This helps us check if our cleaning pipeline was successful
"""
for col in pop.columns:
print(pop[col].value_counts())
#check_value_counts(pop)
Tip: Add additional cells to keep everything in reasonably-sized chunks! Keyboard shortcut
esc --> a(press escape to enter command mode, then press the 'A' key) adds a new cell before the active cell, andesc --> badds a new cell after the active cell. If you need to convert an active cell to a markdown cell, useesc --> mand to convert to a code cell, useesc --> y.
Step 1: Preprocessing¶
Step 1.1: Assess Missing Data¶
The feature summary file contains a summary of properties for each demographics data column. You will use this file to help you make cleaning decisions during this stage of the project. First of all, you should assess the demographics data in terms of missing data. Pay attention to the following points as you perform your analysis, and take notes on what you observe. Make sure that you fill in the Discussion cell with your findings and decisions at the end of each step that has one!
Step 1.1.1: Convert Missing Value Codes to NaNs¶
The fourth column of the feature attributes summary (loaded in above as feat_info) documents the codes from the data dictionary that indicate missing or unknown data. While the file encodes this as a list (e.g. [-1,0]), this will get read in as a string object. You'll need to do a little bit of parsing to make use of it to identify and clean the data. Convert data that matches a 'missing' or 'unknown' value code into a numpy NaN value. You might want to see how much data takes on a 'missing' or 'unknown' code, and how much data is naturally missing, as a point of interest.
As one more reminder, you are encouraged to add additional cells to break up your analysis into manageable chunks.
feat_info['missing_or_unknown'].value_counts()
missing_or_unknown [-1] 26 [-1,9] 17 [-1,0] 16 [0] 12 [] 10 [-1,0,9] 1 [-1,X] 1 [XX] 1 [-1,XX] 1 Name: count, dtype: int64
The 'missing_or_unknown' explains what values of our Population (pop) columns were incorrectly encoded, hence we should make them np.nans
print(f'The total number of missing or unknown values in our Populations data is {pop.isna().sum().sum()}')
The total number of missing or unknown values in our Populations data is 4896838
# Identifying and converting missing_or_unknown data.
for x in range(len(feat_info)):
# using regex to extract the digits
missing_data = re.sub('[\[|\]]', '',feat_info.iloc[x]['missing_or_unknown']).split(',')
# return the digit values for integer strings only
if missing_data != ['']:
missing_data = [np.int64(data) if (data!='X' and data!='XX') else data for data in missing_data]
pop = pop.replace({feat_info.iloc[x]['attribute']: missing_data}, np.nan)
print(f'Now, the value of missing or unknown values in our Populations data, now is {pop.isna().sum().sum()}')
Now, the value of missing or unknown values in our Populations data, now is 8373929
#check_value_counts(pop)
Reconfirming the training pipeline worked effectively, all looks good !
Step 1.1.2: Assess Missing Data in Each Column¶
How much missing data is present in each column? There are a few columns that are outliers in terms of the proportion of values that are missing. You will want to use matplotlib's hist() function to visualize the distribution of missing value counts to find these columns. Identify and document these columns. While some of these columns might have justifications for keeping or re-encoding the data, for this project you should just remove them from the dataframe. (Feel free to make remarks about these outlier columns in the discussion, however!)
For the remaining features, are there any patterns in which columns have, or share, missing data?
missing = ((pop.isna().sum()/pop.shape[0])*100).sort_values(ascending=False)
missing
TITEL_KZ 99.757636 AGER_TYP 76.955435 KK_KUNDENTYP 65.596749 KBA05_BAUMAX 53.468668 GEBURTSJAHR 44.020282 ALTER_HH 34.813699 KKK 17.735668 REGIOTYP 17.735668 W_KEIT_KIND_HH 16.605084 KBA05_ANTG1 14.959701 KBA05_ANTG2 14.959701 KBA05_ANTG3 14.959701 KBA05_ANTG4 14.959701 KBA05_GBZ 14.959701 MOBI_REGIO 14.959701 PLZ8_ANTG3 13.073637 PLZ8_ANTG2 13.073637 PLZ8_GBZ 13.073637 PLZ8_HHZ 13.073637 PLZ8_ANTG1 13.073637 PLZ8_BAUMAX 13.073637 PLZ8_ANTG4 13.073637 VERS_TYP 12.476816 HEALTH_TYP 12.476816 SHOPPER_TYP 12.476816 NATIONALITAET_KZ 12.153551 PRAEGENDE_JUGENDJAHRE 12.136608 KBA13_ANZAHL_PKW 11.871354 ANZ_HAUSHALTE_AKTIV 11.176913 CAMEO_INTL_2015 11.147852 CAMEO_DEU_2015 11.147852 CAMEO_DEUG_2015 11.147852 LP_LEBENSPHASE_FEIN 10.954859 ARBEIT 10.926022 RELAT_AB 10.926022 ORTSGR_KLS9 10.914689 ANZ_HH_TITEL 10.884842 LP_LEBENSPHASE_GROB 10.611509 INNENSTADT 10.518154 EWDICHTE 10.518154 BALLRAUM 10.518154 GEBAEUDETYP_RASTER 10.452514 WOHNLAGE 10.451729 OST_WEST_KZ 10.451729 MIN_GEBAEUDEJAHR 10.451729 GEBAEUDETYP 10.451729 LP_FAMILIE_FEIN 8.728699 LP_FAMILIE_GROB 8.728699 KONSUMNAEHE 8.299737 WOHNDAUER_2008 8.247000 SOHO_KZ 8.247000 ANZ_TITEL 8.247000 ANZ_PERSONEN 8.247000 HH_EINKOMMEN_SCORE 2.058749 GFK_URLAUBERTYP 0.544646 CJT_GESAMTTYP 0.544646 LP_STATUS_FEIN 0.544646 LP_STATUS_GROB 0.544646 RETOURTYP_BK_S 0.544646 ONLINE_AFFINITAET 0.544646 ALTERSKATEGORIE_GROB 0.323264 FINANZ_UNAUFFAELLIGER 0.000000 FINANZTYP 0.000000 FINANZ_HAUSBAUER 0.000000 GREEN_AVANTGARDE 0.000000 FINANZ_SPARER 0.000000 FINANZ_MINIMALIST 0.000000 FINANZ_VORSORGER 0.000000 FINANZ_ANLEGER 0.000000 ANREDE_KZ 0.000000 SEMIO_KAEM 0.000000 SEMIO_SOZ 0.000000 SEMIO_PFLICHT 0.000000 SEMIO_FAM 0.000000 SEMIO_REL 0.000000 SEMIO_MAT 0.000000 SEMIO_VERT 0.000000 SEMIO_LUST 0.000000 SEMIO_ERL 0.000000 SEMIO_KULT 0.000000 SEMIO_RAT 0.000000 SEMIO_KRIT 0.000000 SEMIO_DOM 0.000000 SEMIO_TRADV 0.000000 ZABEOTYP 0.000000 dtype: float64
missing_sorted = missing.sort_values(ascending=False)
missing_sorted.index
Index(['TITEL_KZ', 'AGER_TYP', 'KK_KUNDENTYP', 'KBA05_BAUMAX', 'GEBURTSJAHR',
'ALTER_HH', 'KKK', 'REGIOTYP', 'W_KEIT_KIND_HH', 'MOBI_REGIO',
'KBA05_GBZ', 'KBA05_ANTG4', 'KBA05_ANTG2', 'KBA05_ANTG1', 'KBA05_ANTG3',
'PLZ8_ANTG3', 'PLZ8_ANTG2', 'PLZ8_GBZ', 'PLZ8_HHZ', 'PLZ8_ANTG1',
'PLZ8_BAUMAX', 'PLZ8_ANTG4', 'SHOPPER_TYP', 'VERS_TYP', 'HEALTH_TYP',
'NATIONALITAET_KZ', 'PRAEGENDE_JUGENDJAHRE', 'KBA13_ANZAHL_PKW',
'ANZ_HAUSHALTE_AKTIV', 'CAMEO_INTL_2015', 'CAMEO_DEU_2015',
'CAMEO_DEUG_2015', 'LP_LEBENSPHASE_FEIN', 'RELAT_AB', 'ARBEIT',
'ORTSGR_KLS9', 'ANZ_HH_TITEL', 'LP_LEBENSPHASE_GROB', 'INNENSTADT',
'EWDICHTE', 'BALLRAUM', 'GEBAEUDETYP_RASTER', 'MIN_GEBAEUDEJAHR',
'GEBAEUDETYP', 'WOHNLAGE', 'OST_WEST_KZ', 'LP_FAMILIE_GROB',
'LP_FAMILIE_FEIN', 'KONSUMNAEHE', 'WOHNDAUER_2008', 'SOHO_KZ',
'ANZ_TITEL', 'ANZ_PERSONEN', 'HH_EINKOMMEN_SCORE', 'LP_STATUS_GROB',
'ONLINE_AFFINITAET', 'RETOURTYP_BK_S', 'LP_STATUS_FEIN',
'CJT_GESAMTTYP', 'GFK_URLAUBERTYP', 'ALTERSKATEGORIE_GROB', 'SEMIO_FAM',
'SEMIO_TRADV', 'SEMIO_DOM', 'SEMIO_KRIT', 'SEMIO_RAT', 'SEMIO_KULT',
'SEMIO_ERL', 'SEMIO_LUST', 'SEMIO_VERT', 'SEMIO_MAT', 'SEMIO_REL',
'FINANZ_SPARER', 'SEMIO_PFLICHT', 'SEMIO_SOZ', 'SEMIO_KAEM',
'ANREDE_KZ', 'FINANZ_ANLEGER', 'FINANZ_VORSORGER', 'FINANZ_MINIMALIST',
'GREEN_AVANTGARDE', 'FINANZ_HAUSBAUER', 'FINANZTYP',
'FINANZ_UNAUFFAELLIGER', 'ZABEOTYP'],
dtype='object')
missing_sorted.values
array([99.75763587, 76.9554353 , 65.59674873, 53.46866827, 44.02028229,
34.81369941, 17.73566826, 17.73566826, 16.60508449, 14.95970135,
14.95970135, 14.95970135, 14.95970135, 14.95970135, 14.95970135,
13.07363718, 13.07363718, 13.07363718, 13.07363718, 13.07363718,
13.07363718, 13.07363718, 12.47681551, 12.47681551, 12.47681551,
12.15355114, 12.13660809, 11.87135402, 11.17691347, 11.14785222,
11.14785222, 11.14785222, 10.95485856, 10.92602172, 10.92602172,
10.91468895, 10.88484226, 10.61150938, 10.51815431, 10.51815431,
10.51815431, 10.45251402, 10.45172858, 10.45172858, 10.45172858,
10.45172858, 8.72869917, 8.72869917, 8.2997371 , 8.24700046,
8.24700046, 8.24700046, 8.24700046, 2.05874862, 0.54464605,
0.54464605, 0.54464605, 0.54464605, 0.54464605, 0.54464605,
0.32326438, 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ])
# Create a bar plot instead of a histogram for better representation
plt.figure(figsize=(12, 9))
plt.bar(missing_sorted.index, missing_sorted.values)
plt.xticks(rotation=75);
- Interesing first 6 bars, let us investigate further
# Perform an assessment of how much missing data there is in each column of the
# dataset.
plt.figure(figsize=(12, 9))
plt.hist(missing_sorted.values, bins=len(missing)+10, edgecolor='white', label=missing.index);
For this analysis, we drop the 6 Columns with more than 30% of its values missing, because they make up the last 6 little bars you see in the above plot¶
- Relating data in terms of percentage, we see how
TITEL_KZespecially has 99.757636% of its values missing. Attempting to fill this would equate adding noise to our dataset, as the odds of knowing how to fill it are low
- Subsequent dimensions such as
AGER_TYPandKK_KUNDENTYPcolumns might be dropped but only after more scrutiny as they have 76.9% and 65.6% of them empty respectfully
- So also,
KBA05_BAUMAX,GEBURTSJAHRandALTER_HHwould be dropped
- The values of the other few columns should duely be filled with whatever method we deem fit
pop = pop.drop(['TITEL_KZ','AGER_TYP','KK_KUNDENTYP','KBA05_BAUMAX','GEBURTSJAHR','ALTER_HH'], axis=1)
pop.describe()
| ALTERSKATEGORIE_GROB | ANREDE_KZ | CJT_GESAMTTYP | FINANZ_MINIMALIST | FINANZ_SPARER | FINANZ_VORSORGER | FINANZ_ANLEGER | FINANZ_UNAUFFAELLIGER | FINANZ_HAUSBAUER | FINANZTYP | GFK_URLAUBERTYP | GREEN_AVANTGARDE | HEALTH_TYP | LP_LEBENSPHASE_FEIN | LP_LEBENSPHASE_GROB | LP_FAMILIE_FEIN | LP_FAMILIE_GROB | LP_STATUS_FEIN | LP_STATUS_GROB | NATIONALITAET_KZ | PRAEGENDE_JUGENDJAHRE | RETOURTYP_BK_S | SEMIO_SOZ | SEMIO_FAM | SEMIO_REL | SEMIO_MAT | SEMIO_VERT | SEMIO_LUST | SEMIO_ERL | SEMIO_KULT | SEMIO_RAT | SEMIO_KRIT | SEMIO_DOM | SEMIO_KAEM | SEMIO_PFLICHT | SEMIO_TRADV | SHOPPER_TYP | SOHO_KZ | VERS_TYP | ZABEOTYP | ANZ_PERSONEN | ANZ_TITEL | HH_EINKOMMEN_SCORE | W_KEIT_KIND_HH | WOHNDAUER_2008 | ANZ_HAUSHALTE_AKTIV | ANZ_HH_TITEL | GEBAEUDETYP | KONSUMNAEHE | MIN_GEBAEUDEJAHR | WOHNLAGE | KBA05_ANTG1 | KBA05_ANTG2 | KBA05_ANTG3 | KBA05_ANTG4 | KBA05_GBZ | BALLRAUM | EWDICHTE | INNENSTADT | GEBAEUDETYP_RASTER | KKK | MOBI_REGIO | ONLINE_AFFINITAET | REGIOTYP | KBA13_ANZAHL_PKW | PLZ8_ANTG1 | PLZ8_ANTG2 | PLZ8_ANTG3 | PLZ8_ANTG4 | PLZ8_BAUMAX | PLZ8_HHZ | PLZ8_GBZ | ARBEIT | ORTSGR_KLS9 | RELAT_AB | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 888340.000000 | 891221.000000 | 886367.000000 | 891221.000000 | 891221.000000 | 891221.000000 | 891221.000000 | 891221.000000 | 891221.000000 | 891221.000000 | 886367.000000 | 891221.000000 | 780025.000000 | 793589.000000 | 796649.000000 | 813429.000000 | 813429.000000 | 886367.000000 | 886367.000000 | 782906.000000 | 783057.000000 | 886367.000000 | 891221.000000 | 891221.000000 | 891221.000000 | 891221.000000 | 891221.000000 | 891221.000000 | 891221.000000 | 891221.000000 | 891221.000000 | 891221.000000 | 891221.000000 | 891221.000000 | 891221.000000 | 891221.000000 | 780025.000000 | 817722.000000 | 780025.000000 | 891221.000000 | 817722.000000 | 817722.000000 | 872873.000000 | 743233.000000 | 817722.000000 | 791610.000000 | 794213.000000 | 798073.000000 | 817252.000000 | 798073.000000 | 798073.000000 | 757897.000000 | 757897.000000 | 757897.000000 | 757897.000000 | 757897.000000 | 797481.000000 | 797481.000000 | 797481.000000 | 798066.000000 | 733157.000000 | 757897.000000 | 886367.000000 | 733157.000000 | 785421.000000 | 774706.000000 | 774706.000000 | 774706.000000 | 774706.000000 | 774706.000000 | 774706.000000 | 774706.000000 | 793846.000000 | 793947.000000 | 793846.000000 |
| mean | 2.757217 | 1.522098 | 3.632838 | 3.074528 | 2.821039 | 3.401106 | 3.033328 | 2.874167 | 3.075121 | 3.790586 | 7.350304 | 0.196612 | 2.190129 | 16.332161 | 4.955185 | 3.922339 | 2.381976 | 4.791151 | 2.432575 | 1.168889 | 9.280709 | 3.419630 | 3.945860 | 4.272729 | 4.240609 | 4.001597 | 4.023709 | 4.359086 | 4.481405 | 4.025014 | 3.910139 | 4.763223 | 4.667550 | 4.445007 | 4.256076 | 3.661784 | 1.590134 | 0.008423 | 1.511166 | 3.362438 | 1.727637 | 0.004162 | 4.207243 | 4.147141 | 7.908791 | 8.354924 | 0.040647 | 2.798641 | 3.018452 | 1993.277011 | 4.052836 | 1.494277 | 1.265584 | 0.624525 | 0.305927 | 3.158580 | 4.153043 | 3.939172 | 4.549491 | 3.738306 | 2.723384 | 2.963540 | 2.698691 | 4.472086 | 619.701439 | 2.253330 | 2.801858 | 1.595426 | 0.699166 | 1.943913 | 3.612821 | 3.381087 | 3.166686 | 5.293389 | 3.071033 |
| std | 1.009951 | 0.499512 | 1.595021 | 1.321055 | 1.464749 | 1.322134 | 1.529603 | 1.486731 | 1.353248 | 1.987876 | 3.525723 | 0.397437 | 0.755213 | 12.242378 | 3.748974 | 3.941285 | 1.701527 | 3.425305 | 1.474315 | 0.475075 | 4.032107 | 1.417741 | 1.946564 | 1.915885 | 2.007373 | 1.857540 | 2.077746 | 2.022829 | 1.807552 | 1.903816 | 1.580306 | 1.830789 | 1.795712 | 1.852412 | 1.770137 | 1.707637 | 1.027972 | 0.091392 | 0.499876 | 1.352704 | 1.155849 | 0.068855 | 1.624057 | 1.784211 | 1.923137 | 15.673731 | 0.324028 | 2.656713 | 1.550312 | 3.332739 | 1.949539 | 1.403961 | 1.245178 | 1.013443 | 0.638725 | 1.329537 | 2.183710 | 1.718996 | 2.028919 | 0.923193 | 0.979867 | 1.428882 | 1.521524 | 1.836357 | 340.034318 | 0.972008 | 0.920309 | 0.986736 | 0.727137 | 1.459654 | 0.973967 | 1.111598 | 0.999072 | 2.303379 | 1.360532 |
| min | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 0.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 0.000000 | 1.000000 | 1.000000 | 1985.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 |
| 25% | 2.000000 | 1.000000 | 2.000000 | 2.000000 | 1.000000 | 3.000000 | 2.000000 | 2.000000 | 2.000000 | 2.000000 | 5.000000 | 0.000000 | 2.000000 | 6.000000 | 2.000000 | 1.000000 | 1.000000 | 2.000000 | 1.000000 | 1.000000 | 6.000000 | 2.000000 | 2.000000 | 3.000000 | 3.000000 | 2.000000 | 2.000000 | 2.000000 | 3.000000 | 3.000000 | 3.000000 | 3.000000 | 3.000000 | 3.000000 | 3.000000 | 2.000000 | 1.000000 | 0.000000 | 1.000000 | 3.000000 | 1.000000 | 0.000000 | 3.000000 | 3.000000 | 8.000000 | 2.000000 | 0.000000 | 1.000000 | 2.000000 | 1992.000000 | 3.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 2.000000 | 2.000000 | 2.000000 | 3.000000 | 3.000000 | 2.000000 | 2.000000 | 1.000000 | 3.000000 | 384.000000 | 1.000000 | 2.000000 | 1.000000 | 0.000000 | 1.000000 | 3.000000 | 3.000000 | 3.000000 | 4.000000 | 2.000000 |
| 50% | 3.000000 | 2.000000 | 4.000000 | 3.000000 | 3.000000 | 3.000000 | 3.000000 | 3.000000 | 3.000000 | 4.000000 | 8.000000 | 0.000000 | 2.000000 | 13.000000 | 3.000000 | 1.000000 | 1.000000 | 4.000000 | 2.000000 | 1.000000 | 9.000000 | 3.000000 | 4.000000 | 4.000000 | 4.000000 | 4.000000 | 4.000000 | 5.000000 | 4.000000 | 4.000000 | 4.000000 | 5.000000 | 5.000000 | 5.000000 | 4.000000 | 3.000000 | 2.000000 | 0.000000 | 2.000000 | 3.000000 | 1.000000 | 0.000000 | 5.000000 | 4.000000 | 9.000000 | 4.000000 | 0.000000 | 1.000000 | 3.000000 | 1992.000000 | 3.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 3.000000 | 5.000000 | 4.000000 | 5.000000 | 4.000000 | 3.000000 | 3.000000 | 3.000000 | 5.000000 | 549.000000 | 2.000000 | 3.000000 | 2.000000 | 1.000000 | 1.000000 | 4.000000 | 3.000000 | 3.000000 | 5.000000 | 3.000000 |
| 75% | 4.000000 | 2.000000 | 5.000000 | 4.000000 | 4.000000 | 5.000000 | 5.000000 | 4.000000 | 4.000000 | 6.000000 | 10.000000 | 0.000000 | 3.000000 | 28.000000 | 8.000000 | 8.000000 | 4.000000 | 9.000000 | 4.000000 | 1.000000 | 14.000000 | 5.000000 | 6.000000 | 6.000000 | 6.000000 | 5.000000 | 6.000000 | 6.000000 | 6.000000 | 5.000000 | 5.000000 | 6.000000 | 6.000000 | 6.000000 | 6.000000 | 5.000000 | 2.000000 | 0.000000 | 2.000000 | 4.000000 | 2.000000 | 0.000000 | 6.000000 | 6.000000 | 9.000000 | 10.000000 | 0.000000 | 3.000000 | 4.000000 | 1993.000000 | 5.000000 | 3.000000 | 2.000000 | 1.000000 | 0.000000 | 4.000000 | 6.000000 | 6.000000 | 6.000000 | 4.000000 | 3.000000 | 4.000000 | 4.000000 | 6.000000 | 778.000000 | 3.000000 | 3.000000 | 2.000000 | 1.000000 | 3.000000 | 4.000000 | 4.000000 | 4.000000 | 7.000000 | 4.000000 |
| max | 4.000000 | 2.000000 | 6.000000 | 5.000000 | 5.000000 | 5.000000 | 5.000000 | 5.000000 | 5.000000 | 6.000000 | 12.000000 | 1.000000 | 3.000000 | 40.000000 | 12.000000 | 11.000000 | 5.000000 | 10.000000 | 5.000000 | 3.000000 | 15.000000 | 5.000000 | 7.000000 | 7.000000 | 7.000000 | 7.000000 | 7.000000 | 7.000000 | 7.000000 | 7.000000 | 7.000000 | 7.000000 | 7.000000 | 7.000000 | 7.000000 | 7.000000 | 3.000000 | 1.000000 | 2.000000 | 6.000000 | 45.000000 | 6.000000 | 6.000000 | 6.000000 | 9.000000 | 595.000000 | 23.000000 | 8.000000 | 7.000000 | 2016.000000 | 8.000000 | 4.000000 | 4.000000 | 3.000000 | 2.000000 | 5.000000 | 7.000000 | 6.000000 | 8.000000 | 5.000000 | 4.000000 | 6.000000 | 5.000000 | 7.000000 | 2300.000000 | 4.000000 | 4.000000 | 3.000000 | 2.000000 | 5.000000 | 5.000000 | 5.000000 | 5.000000 | 9.000000 | 5.000000 |
Knowing outliers occur below and over our Lower and Upper Bounds respectively, we can use BoxPlots to visualize these points
pop.columns
Index(['ALTERSKATEGORIE_GROB', 'ANREDE_KZ', 'CJT_GESAMTTYP',
'FINANZ_MINIMALIST', 'FINANZ_SPARER', 'FINANZ_VORSORGER',
'FINANZ_ANLEGER', 'FINANZ_UNAUFFAELLIGER', 'FINANZ_HAUSBAUER',
'FINANZTYP', 'GFK_URLAUBERTYP', 'GREEN_AVANTGARDE', 'HEALTH_TYP',
'LP_LEBENSPHASE_FEIN', 'LP_LEBENSPHASE_GROB', 'LP_FAMILIE_FEIN',
'LP_FAMILIE_GROB', 'LP_STATUS_FEIN', 'LP_STATUS_GROB',
'NATIONALITAET_KZ', 'PRAEGENDE_JUGENDJAHRE', 'RETOURTYP_BK_S',
'SEMIO_SOZ', 'SEMIO_FAM', 'SEMIO_REL', 'SEMIO_MAT', 'SEMIO_VERT',
'SEMIO_LUST', 'SEMIO_ERL', 'SEMIO_KULT', 'SEMIO_RAT', 'SEMIO_KRIT',
'SEMIO_DOM', 'SEMIO_KAEM', 'SEMIO_PFLICHT', 'SEMIO_TRADV',
'SHOPPER_TYP', 'SOHO_KZ', 'VERS_TYP', 'ZABEOTYP', 'ANZ_PERSONEN',
'ANZ_TITEL', 'HH_EINKOMMEN_SCORE', 'W_KEIT_KIND_HH', 'WOHNDAUER_2008',
'ANZ_HAUSHALTE_AKTIV', 'ANZ_HH_TITEL', 'GEBAEUDETYP', 'KONSUMNAEHE',
'MIN_GEBAEUDEJAHR', 'OST_WEST_KZ', 'WOHNLAGE', 'CAMEO_DEUG_2015',
'CAMEO_DEU_2015', 'CAMEO_INTL_2015', 'KBA05_ANTG1', 'KBA05_ANTG2',
'KBA05_ANTG3', 'KBA05_ANTG4', 'KBA05_GBZ', 'BALLRAUM', 'EWDICHTE',
'INNENSTADT', 'GEBAEUDETYP_RASTER', 'KKK', 'MOBI_REGIO',
'ONLINE_AFFINITAET', 'REGIOTYP', 'KBA13_ANZAHL_PKW', 'PLZ8_ANTG1',
'PLZ8_ANTG2', 'PLZ8_ANTG3', 'PLZ8_ANTG4', 'PLZ8_BAUMAX', 'PLZ8_HHZ',
'PLZ8_GBZ', 'ARBEIT', 'ORTSGR_KLS9', 'RELAT_AB'],
dtype='object')
- After running the first time, I realised the loop broke because some columns didnt have numeric values
valueless_columns = []
for col in pop.columns:
try:
pop[col].plot(kind='box')
plt.title(f'Box plot for {col}')
plt.show()
except TypeError:
valueless_columns.append(col)
continue
print(valueless_columns)
['OST_WEST_KZ', 'CAMEO_DEUG_2015', 'CAMEO_DEU_2015', 'CAMEO_INTL_2015']
for value in valueless_columns:
print(pop[value].value_counts().reset_index().iloc[:,0].values) #checking the values that are alien to us
print(f'Values in {value} are above this text\n\n')
['W' 'O'] Values in OST_WEST_KZ are above this text ['8' '9' '6' '4' '3' '2' '7' '5' '1'] Values in CAMEO_DEUG_2015 are above this text ['6B' '8A' '4C' '2D' '3C' '7A' '3D' '8B' '4A' '8C' '9D' '9B' '9C' '7B' '9A' '2C' '8D' '6E' '2B' '5D' '6C' '2A' '5A' '1D' '1A' '3A' '5B' '5C' '7C' '4B' '4D' '3B' '6A' '9E' '6D' '6F' '7D' '4E' '1E' '7E' '1C' '5F' '1B' '5E'] Values in CAMEO_DEU_2015 are above this text ['51' '41' '24' '14' '43' '54' '25' '22' '23' '13' '45' '55' '52' '31' '34' '15' '44' '12' '35' '32' '33'] Values in CAMEO_INTL_2015 are above this text
- We missed these when visually confirming the output of 'check_value_counts()'
pop[valueless_columns].info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 891221 entries, 0 to 891220 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 OST_WEST_KZ 798073 non-null object 1 CAMEO_DEUG_2015 791869 non-null object 2 CAMEO_DEU_2015 791869 non-null object 3 CAMEO_INTL_2015 791869 non-null object dtypes: object(4) memory usage: 27.2+ MB
- From this, we can see how the
CAMEO_INTL_2015,CAMEO_DEUG_2015only need to be encoded properly.
pop['CAMEO_INTL_2015'] = pd.to_numeric(pop['CAMEO_INTL_2015'], downcast='integer', errors='coerce')
pop['CAMEO_DEUG_2015'] = pd.to_numeric(pop['CAMEO_DEUG_2015'], downcast='integer', errors='coerce')
pop.head().select_dtypes(include='number').columns
Index(['ALTERSKATEGORIE_GROB', 'ANREDE_KZ', 'CJT_GESAMTTYP',
'FINANZ_MINIMALIST', 'FINANZ_SPARER', 'FINANZ_VORSORGER',
'FINANZ_ANLEGER', 'FINANZ_UNAUFFAELLIGER', 'FINANZ_HAUSBAUER',
'FINANZTYP', 'GFK_URLAUBERTYP', 'GREEN_AVANTGARDE', 'HEALTH_TYP',
'LP_LEBENSPHASE_FEIN', 'LP_LEBENSPHASE_GROB', 'LP_FAMILIE_FEIN',
'LP_FAMILIE_GROB', 'LP_STATUS_FEIN', 'LP_STATUS_GROB',
'NATIONALITAET_KZ', 'PRAEGENDE_JUGENDJAHRE', 'RETOURTYP_BK_S',
'SEMIO_SOZ', 'SEMIO_FAM', 'SEMIO_REL', 'SEMIO_MAT', 'SEMIO_VERT',
'SEMIO_LUST', 'SEMIO_ERL', 'SEMIO_KULT', 'SEMIO_RAT', 'SEMIO_KRIT',
'SEMIO_DOM', 'SEMIO_KAEM', 'SEMIO_PFLICHT', 'SEMIO_TRADV',
'SHOPPER_TYP', 'SOHO_KZ', 'VERS_TYP', 'ZABEOTYP', 'ANZ_PERSONEN',
'ANZ_TITEL', 'HH_EINKOMMEN_SCORE', 'W_KEIT_KIND_HH', 'WOHNDAUER_2008',
'ANZ_HAUSHALTE_AKTIV', 'ANZ_HH_TITEL', 'GEBAEUDETYP', 'KONSUMNAEHE',
'MIN_GEBAEUDEJAHR', 'WOHNLAGE', 'CAMEO_DEUG_2015', 'CAMEO_INTL_2015',
'KBA05_ANTG1', 'KBA05_ANTG2', 'KBA05_ANTG3', 'KBA05_ANTG4', 'KBA05_GBZ',
'BALLRAUM', 'EWDICHTE', 'INNENSTADT', 'GEBAEUDETYP_RASTER', 'KKK',
'MOBI_REGIO', 'ONLINE_AFFINITAET', 'REGIOTYP', 'KBA13_ANZAHL_PKW',
'PLZ8_ANTG1', 'PLZ8_ANTG2', 'PLZ8_ANTG3', 'PLZ8_ANTG4', 'PLZ8_BAUMAX',
'PLZ8_HHZ', 'PLZ8_GBZ', 'ARBEIT', 'ORTSGR_KLS9', 'RELAT_AB'],
dtype='object')
pop.head().select_dtypes(include='object').columns #verifying what our are as they should
Index(['OST_WEST_KZ', 'CAMEO_DEU_2015'], dtype='object')
len(pop.head().select_dtypes(include='number').columns)
77
Dropping Columns now¶
- On first thought, I thought the
CAMEO_DEU_2015itself needs an extraction of the Alphabets begind each letter, but iinvestigating further, I realise we dont need to decode its values
pop = pop.drop(['CAMEO_DEU_2015'], axis=1)
- Knowing well that Z score being a function of how a point deviates from other points, we can use a high Z threshold to remove outliers, but not too many as to alter out analysis
# Investigate patterns in the amount of missing data in each column.
pop.head().select_dtypes(include='number').columns
Index(['ALTERSKATEGORIE_GROB', 'ANREDE_KZ', 'CJT_GESAMTTYP',
'FINANZ_MINIMALIST', 'FINANZ_SPARER', 'FINANZ_VORSORGER',
'FINANZ_ANLEGER', 'FINANZ_UNAUFFAELLIGER', 'FINANZ_HAUSBAUER',
'FINANZTYP', 'GFK_URLAUBERTYP', 'GREEN_AVANTGARDE', 'HEALTH_TYP',
'LP_LEBENSPHASE_FEIN', 'LP_LEBENSPHASE_GROB', 'LP_FAMILIE_FEIN',
'LP_FAMILIE_GROB', 'LP_STATUS_FEIN', 'LP_STATUS_GROB',
'NATIONALITAET_KZ', 'PRAEGENDE_JUGENDJAHRE', 'RETOURTYP_BK_S',
'SEMIO_SOZ', 'SEMIO_FAM', 'SEMIO_REL', 'SEMIO_MAT', 'SEMIO_VERT',
'SEMIO_LUST', 'SEMIO_ERL', 'SEMIO_KULT', 'SEMIO_RAT', 'SEMIO_KRIT',
'SEMIO_DOM', 'SEMIO_KAEM', 'SEMIO_PFLICHT', 'SEMIO_TRADV',
'SHOPPER_TYP', 'SOHO_KZ', 'VERS_TYP', 'ZABEOTYP', 'ANZ_PERSONEN',
'ANZ_TITEL', 'HH_EINKOMMEN_SCORE', 'W_KEIT_KIND_HH', 'WOHNDAUER_2008',
'ANZ_HAUSHALTE_AKTIV', 'ANZ_HH_TITEL', 'GEBAEUDETYP', 'KONSUMNAEHE',
'MIN_GEBAEUDEJAHR', 'WOHNLAGE', 'CAMEO_DEUG_2015', 'CAMEO_INTL_2015',
'KBA05_ANTG1', 'KBA05_ANTG2', 'KBA05_ANTG3', 'KBA05_ANTG4', 'KBA05_GBZ',
'BALLRAUM', 'EWDICHTE', 'INNENSTADT', 'GEBAEUDETYP_RASTER', 'KKK',
'MOBI_REGIO', 'ONLINE_AFFINITAET', 'REGIOTYP', 'KBA13_ANZAHL_PKW',
'PLZ8_ANTG1', 'PLZ8_ANTG2', 'PLZ8_ANTG3', 'PLZ8_ANTG4', 'PLZ8_BAUMAX',
'PLZ8_HHZ', 'PLZ8_GBZ', 'ARBEIT', 'ORTSGR_KLS9', 'RELAT_AB'],
dtype='object')
pop.select_dtypes(include='object').keys()
Index(['OST_WEST_KZ'], dtype='object')
pop.select_dtypes(include='number').keys()
Index(['ALTERSKATEGORIE_GROB', 'ANREDE_KZ', 'CJT_GESAMTTYP',
'FINANZ_MINIMALIST', 'FINANZ_SPARER', 'FINANZ_VORSORGER',
'FINANZ_ANLEGER', 'FINANZ_UNAUFFAELLIGER', 'FINANZ_HAUSBAUER',
'FINANZTYP', 'GFK_URLAUBERTYP', 'GREEN_AVANTGARDE', 'HEALTH_TYP',
'LP_LEBENSPHASE_FEIN', 'LP_LEBENSPHASE_GROB', 'LP_FAMILIE_FEIN',
'LP_FAMILIE_GROB', 'LP_STATUS_FEIN', 'LP_STATUS_GROB',
'NATIONALITAET_KZ', 'PRAEGENDE_JUGENDJAHRE', 'RETOURTYP_BK_S',
'SEMIO_SOZ', 'SEMIO_FAM', 'SEMIO_REL', 'SEMIO_MAT', 'SEMIO_VERT',
'SEMIO_LUST', 'SEMIO_ERL', 'SEMIO_KULT', 'SEMIO_RAT', 'SEMIO_KRIT',
'SEMIO_DOM', 'SEMIO_KAEM', 'SEMIO_PFLICHT', 'SEMIO_TRADV',
'SHOPPER_TYP', 'SOHO_KZ', 'VERS_TYP', 'ZABEOTYP', 'ANZ_PERSONEN',
'ANZ_TITEL', 'HH_EINKOMMEN_SCORE', 'W_KEIT_KIND_HH', 'WOHNDAUER_2008',
'ANZ_HAUSHALTE_AKTIV', 'ANZ_HH_TITEL', 'GEBAEUDETYP', 'KONSUMNAEHE',
'MIN_GEBAEUDEJAHR', 'WOHNLAGE', 'CAMEO_DEUG_2015', 'CAMEO_INTL_2015',
'KBA05_ANTG1', 'KBA05_ANTG2', 'KBA05_ANTG3', 'KBA05_ANTG4', 'KBA05_GBZ',
'BALLRAUM', 'EWDICHTE', 'INNENSTADT', 'GEBAEUDETYP_RASTER', 'KKK',
'MOBI_REGIO', 'ONLINE_AFFINITAET', 'REGIOTYP', 'KBA13_ANZAHL_PKW',
'PLZ8_ANTG1', 'PLZ8_ANTG2', 'PLZ8_ANTG3', 'PLZ8_ANTG4', 'PLZ8_BAUMAX',
'PLZ8_HHZ', 'PLZ8_GBZ', 'ARBEIT', 'ORTSGR_KLS9', 'RELAT_AB'],
dtype='object')
pop['OST_WEST_KZ'].value_counts()
OST_WEST_KZ W 629528 O 168545 Name: count, dtype: int64
pop['OST_WEST_KZ'] = pop['OST_WEST_KZ'].replace({'W':0, 'O':1}) #numeric encoding
pop['OST_WEST_KZ'] = pd.to_numeric(pop['OST_WEST_KZ'], errors='coerce') #converting to Int, creating Nans to be worked on
pop['OST_WEST_KZ'].value_counts()
OST_WEST_KZ 0.0 629528 1.0 168545 Name: count, dtype: int64
variances = pop.var()
plt.figure(figsize=(10, 6))
plt.hist(variances, bins=10, color='blue', alpha=0.7)
plt.title('Histogram of Feature Variances')
plt.xlabel('Variance')
plt.ylabel('Frequency')
plt.grid(axis='y', alpha=0.75)
plt.axvline(x=variances.mean(), color='red', linestyle='--', label='Mean Variance')
plt.legend()
plt.show()
variance_count = pop.var().sort_values(ascending=False) #computes the variance for a dataframe
variance_count
KBA13_ANZAHL_PKW 115623.337229 ANZ_HAUSHALTE_AKTIV 245.665852 CAMEO_INTL_2015 196.854351 LP_LEBENSPHASE_FEIN 149.875812 PRAEGENDE_JUGENDJAHRE 16.257885 LP_FAMILIE_FEIN 15.533731 LP_LEBENSPHASE_GROB 14.054807 GFK_URLAUBERTYP 12.430724 LP_STATUS_FEIN 11.732715 MIN_GEBAEUDEJAHR 11.107149 GEBAEUDETYP 7.058126 CAMEO_DEUG_2015 6.199253 ORTSGR_KLS9 5.305553 BALLRAUM 4.768589 SEMIO_VERT 4.317029 INNENSTADT 4.116513 SEMIO_LUST 4.091839 SEMIO_REL 4.029545 FINANZTYP 3.951651 WOHNLAGE 3.800701 SEMIO_SOZ 3.789112 WOHNDAUER_2008 3.698456 SEMIO_FAM 3.670615 SEMIO_KULT 3.624516 SEMIO_MAT 3.450455 SEMIO_KAEM 3.431431 REGIOTYP 3.372207 SEMIO_KRIT 3.351790 SEMIO_ERL 3.267244 SEMIO_DOM 3.224582 W_KEIT_KIND_HH 3.183411 SEMIO_PFLICHT 3.133384 EWDICHTE 2.954946 SEMIO_TRADV 2.916023 LP_FAMILIE_GROB 2.895195 HH_EINKOMMEN_SCORE 2.637561 CJT_GESAMTTYP 2.544092 SEMIO_RAT 2.497367 KONSUMNAEHE 2.403467 FINANZ_ANLEGER 2.339684 ONLINE_AFFINITAET 2.315034 FINANZ_UNAUFFAELLIGER 2.210369 LP_STATUS_GROB 2.173603 FINANZ_SPARER 2.145489 PLZ8_BAUMAX 2.130589 MOBI_REGIO 2.041704 RETOURTYP_BK_S 2.009990 KBA05_ANTG1 1.971106 RELAT_AB 1.851048 FINANZ_HAUSBAUER 1.831280 ZABEOTYP 1.829809 KBA05_GBZ 1.767670 FINANZ_VORSORGER 1.748039 FINANZ_MINIMALIST 1.745186 KBA05_ANTG2 1.550467 ANZ_PERSONEN 1.335988 PLZ8_GBZ 1.235649 SHOPPER_TYP 1.056726 KBA05_ANTG3 1.027067 ALTERSKATEGORIE_GROB 1.020000 ARBEIT 0.998145 PLZ8_ANTG3 0.973648 KKK 0.960140 PLZ8_HHZ 0.948611 PLZ8_ANTG1 0.944799 GEBAEUDETYP_RASTER 0.852285 PLZ8_ANTG2 0.846969 HEALTH_TYP 0.570347 PLZ8_ANTG4 0.528728 KBA05_ANTG4 0.407970 VERS_TYP 0.249876 ANREDE_KZ 0.249512 NATIONALITAET_KZ 0.225696 OST_WEST_KZ 0.166589 GREEN_AVANTGARDE 0.157956 ANZ_HH_TITEL 0.104994 SOHO_KZ 0.008352 ANZ_TITEL 0.004741 dtype: float64
variance_count.iloc[0]
115623.33722931633
variances = pop.var()
plt.figure(figsize=(10, 6))
plt.hist(variances, bins=10, color='blue', alpha=0.7)
plt.title('Histogram of Feature Variances')
plt.xlabel('Variance')
plt.ylabel('Frequency')
plt.grid(axis='both', alpha=0.75)
plt.axvline(x=variances.mean(), color='red', linestyle='--', label='Mean Variance')
plt.legend()
plt.show()
- Having a variance of 115623.337229,
KBA13_ANZAHL_PKW, would totally skew our entire data, it is only logical we take it out really. ANZ_HH_TITEL,SOHO_KZ,ANZ_TITELcolumns have 0.104994, 0.008352 and 0.004741 variability in total, very minimal, we can drop these too
variances_to_drop = ['KBA13_ANZAHL_PKW']
pop = pop.drop(axis=1, columns=variances_to_drop)
def identify_outlier_columns(df: pd.DataFrame, threshold: int=4) -> list:
"""
INPUT:
df: pandas dataframe to identify outlier columns
threshold: integer value for valyes that fall outside mean +- 3* STD
OUTPUT:
outlier_columns: a list stating what columns are outliers
"""
outlier_columns = []
for column in df.select_dtypes(include=[np.number]).columns:
Q1 = df[column].quantile(0.25)
Q3 = df[column].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - (threshold * IQR)
upper_bound = Q3 + (threshold * IQR)
if df[(df[column] < lower_bound) | (df[column] > upper_bound)].shape[0] > 0:
outlier_columns.append(column)
return outlier_columns
pop_outliers = identify_outlier_columns(pop)
pop_outliers
['GREEN_AVANTGARDE', 'NATIONALITAET_KZ', 'SOHO_KZ', 'ANZ_PERSONEN', 'ANZ_TITEL', 'WOHNDAUER_2008', 'ANZ_HAUSHALTE_AKTIV', 'ANZ_HH_TITEL', 'MIN_GEBAEUDEJAHR', 'OST_WEST_KZ', 'KBA05_ANTG4']
pop = pop.drop(columns=pop_outliers, axis=1)
pop.shape[1]
66
# Remove the outlier columns from the dataset. (You'll perform other data
# engineering tasks such as re-encoding and imputation later.)
Discussion 1.1.2: Assess Missing / Outlier Data in Each Column¶
Although these were already discussed in earlier markdown cells, underneath is the summary
TITEL_KZespecially has 99.757636% of its values missing.- Columns
AGER_TYPandKK_KUNDENTYPare taken under scrutiny as they have 76.9% and 65.6% of them empty respectfully. OST_WEST_KZ,CAMEO_DEUG_2015,CAMEO_DEU_2015,CAMEO_INTL_2015columns specicially were filled with unformattable data of different types, some we couldn't work with, the others we could wrangle.- In the end,
OST_WEST_KZ,TITEL_KZ, `CAMEO_DEU_2015' had to be dropped after the above considerations KBA13_ANZAHL_PKW,ANZ_HH_TITEL,SOHO_KZ,ANZ_TITELhave also been dropped because of their variance valuesGREEN_AVANTGARDE,NATIONALITAET_KZ,ANZ_PERSONEN,WOHNDAUER_2008,ANZ_HAUSHALTE_AKTIV,MIN_GEBAEUDEJAHR,KBA05_ANTG4didnt meet our threshold for outliers
Step 1.1.3: Assess Missing Data in Each Row¶
Now, you'll perform a similar assessment for the rows of the dataset. How much data is missing in each row? As with the columns, you should see some groups of points that have a very different numbers of missing values. Divide the data into two subsets: one for data points that are above some threshold for missing values, and a second subset for points below that threshold.
In order to know what to do with the outlier rows, we should see if the distribution of data values on columns that are not missing data (or are missing very little data) are similar or different between the two groups. Select at least five of these columns and compare the distribution of values.
- You can use seaborn's
countplot()function to create a bar chart of code frequencies and matplotlib'ssubplot()function to put bar charts for the two subplots side by side. - To reduce repeated code, you might want to write a function that can perform this comparison, taking as one of its arguments a column to be compared.
Depending on what you observe in your comparison, this will have implications on how you approach your conclusions later in the analysis. If the distributions of non-missing features look similar between the data with many missing values and the data with few or no missing values, then we could argue that simply dropping those points from the analysis won't present a major issue. On the other hand, if the data with many missing values looks very different from the data with few or no missing values, then we should make a note on those data as special. We'll revisit these data later on. Either way, you should continue your analysis for now using just the subset of the data with few or no missing values.
# How much data is missing in each row of the dataset?
missing_rows = ((pop.isna().sum(axis=1)/pop.shape[1])*100) #displayingg what percentage of each row is missing
plt.hist(missing_rows);
plt.xlabel('Percentage of missing values')
plt.ylabel('Number of Rows')
plt.title('Row Nan Plot');
- From the plot, a 30% threshold is fine for a segregation really
pop.iloc[0].reset_index().T #examining our first row
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | 40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 | 48 | 49 | 50 | 51 | 52 | 53 | 54 | 55 | 56 | 57 | 58 | 59 | 60 | 61 | 62 | 63 | 64 | 65 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| index | ALTERSKATEGORIE_GROB | ANREDE_KZ | CJT_GESAMTTYP | FINANZ_MINIMALIST | FINANZ_SPARER | FINANZ_VORSORGER | FINANZ_ANLEGER | FINANZ_UNAUFFAELLIGER | FINANZ_HAUSBAUER | FINANZTYP | GFK_URLAUBERTYP | HEALTH_TYP | LP_LEBENSPHASE_FEIN | LP_LEBENSPHASE_GROB | LP_FAMILIE_FEIN | LP_FAMILIE_GROB | LP_STATUS_FEIN | LP_STATUS_GROB | PRAEGENDE_JUGENDJAHRE | RETOURTYP_BK_S | SEMIO_SOZ | SEMIO_FAM | SEMIO_REL | SEMIO_MAT | SEMIO_VERT | SEMIO_LUST | SEMIO_ERL | SEMIO_KULT | SEMIO_RAT | SEMIO_KRIT | SEMIO_DOM | SEMIO_KAEM | SEMIO_PFLICHT | SEMIO_TRADV | SHOPPER_TYP | VERS_TYP | ZABEOTYP | HH_EINKOMMEN_SCORE | W_KEIT_KIND_HH | GEBAEUDETYP | KONSUMNAEHE | WOHNLAGE | CAMEO_DEUG_2015 | CAMEO_INTL_2015 | KBA05_ANTG1 | KBA05_ANTG2 | KBA05_ANTG3 | KBA05_GBZ | BALLRAUM | EWDICHTE | INNENSTADT | GEBAEUDETYP_RASTER | KKK | MOBI_REGIO | ONLINE_AFFINITAET | REGIOTYP | PLZ8_ANTG1 | PLZ8_ANTG2 | PLZ8_ANTG3 | PLZ8_ANTG4 | PLZ8_BAUMAX | PLZ8_HHZ | PLZ8_GBZ | ARBEIT | ORTSGR_KLS9 | RELAT_AB |
| 0 | 2.0 | 1.0 | 2.0 | 3.0 | 4.0 | 3.0 | 5.0 | 5.0 | 3.0 | 4.0 | 10.0 | NaN | 15.0 | 4.0 | 2.0 | 2.0 | 1.0 | 1.0 | NaN | 5.0 | 2.0 | 6.0 | 7.0 | 5.0 | 1.0 | 5.0 | 3.0 | 3.0 | 4.0 | 7.0 | 6.0 | 6.0 | 5.0 | 3.0 | NaN | NaN | 3.0 | 2.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
len(missing_rows)
891221
excess_nans = pop[missing_rows >= 30] #rows with more than 30% of their values missing
tolerable_nans = pop[missing_rows < 30] #rows with at least 30% of their values intact
print(len(excess_nans))
print(len(tolerable_nans))
93557 797664
row_thresh = (len(excess_nans) / (len(excess_nans)+len(tolerable_nans)))*100
print(f'{round(row_thresh)}% of our rows dont meet our 30 percent threshold')
10% of our rows dont meet our 30 percent threshold
pop.head()
| ALTERSKATEGORIE_GROB | ANREDE_KZ | CJT_GESAMTTYP | FINANZ_MINIMALIST | FINANZ_SPARER | FINANZ_VORSORGER | FINANZ_ANLEGER | FINANZ_UNAUFFAELLIGER | FINANZ_HAUSBAUER | FINANZTYP | GFK_URLAUBERTYP | HEALTH_TYP | LP_LEBENSPHASE_FEIN | LP_LEBENSPHASE_GROB | LP_FAMILIE_FEIN | LP_FAMILIE_GROB | LP_STATUS_FEIN | LP_STATUS_GROB | PRAEGENDE_JUGENDJAHRE | RETOURTYP_BK_S | SEMIO_SOZ | SEMIO_FAM | SEMIO_REL | SEMIO_MAT | SEMIO_VERT | SEMIO_LUST | SEMIO_ERL | SEMIO_KULT | SEMIO_RAT | SEMIO_KRIT | SEMIO_DOM | SEMIO_KAEM | SEMIO_PFLICHT | SEMIO_TRADV | SHOPPER_TYP | VERS_TYP | ZABEOTYP | HH_EINKOMMEN_SCORE | W_KEIT_KIND_HH | GEBAEUDETYP | KONSUMNAEHE | WOHNLAGE | CAMEO_DEUG_2015 | CAMEO_INTL_2015 | KBA05_ANTG1 | KBA05_ANTG2 | KBA05_ANTG3 | KBA05_GBZ | BALLRAUM | EWDICHTE | INNENSTADT | GEBAEUDETYP_RASTER | KKK | MOBI_REGIO | ONLINE_AFFINITAET | REGIOTYP | PLZ8_ANTG1 | PLZ8_ANTG2 | PLZ8_ANTG3 | PLZ8_ANTG4 | PLZ8_BAUMAX | PLZ8_HHZ | PLZ8_GBZ | ARBEIT | ORTSGR_KLS9 | RELAT_AB | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2.0 | 1 | 2.0 | 3 | 4 | 3 | 5 | 5 | 3 | 4 | 10.0 | NaN | 15.0 | 4.0 | 2.0 | 2.0 | 1.0 | 1.0 | NaN | 5.0 | 2 | 6 | 7 | 5 | 1 | 5 | 3 | 3 | 4 | 7 | 6 | 6 | 5 | 3 | NaN | NaN | 3 | 2.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1 | 1.0 | 2 | 5.0 | 1 | 5 | 2 | 5 | 4 | 5 | 1 | 10.0 | 3.0 | 21.0 | 6.0 | 5.0 | 3.0 | 2.0 | 1.0 | 14.0 | 1.0 | 5 | 4 | 4 | 3 | 1 | 2 | 2 | 3 | 6 | 4 | 7 | 4 | 7 | 6 | 3.0 | 2.0 | 5 | 6.0 | 3.0 | 8.0 | 1.0 | 4.0 | 8.0 | 51.0 | 0.0 | 0.0 | 0.0 | 1.0 | 6.0 | 3.0 | 8.0 | 3.0 | 2.0 | 1.0 | 3.0 | 3.0 | 2.0 | 3.0 | 2.0 | 1.0 | 1.0 | 5.0 | 4.0 | 3.0 | 5.0 | 4.0 |
| 2 | 3.0 | 2 | 3.0 | 1 | 4 | 1 | 2 | 3 | 5 | 1 | 10.0 | 3.0 | 3.0 | 1.0 | 1.0 | 1.0 | 3.0 | 2.0 | 15.0 | 3.0 | 4 | 1 | 3 | 3 | 4 | 4 | 6 | 3 | 4 | 7 | 7 | 7 | 3 | 3 | 2.0 | 1.0 | 5 | 4.0 | 3.0 | 1.0 | 5.0 | 2.0 | 4.0 | 24.0 | 1.0 | 3.0 | 1.0 | 3.0 | 2.0 | 4.0 | 4.0 | 4.0 | 2.0 | 3.0 | 2.0 | 2.0 | 3.0 | 3.0 | 1.0 | 0.0 | 1.0 | 4.0 | 4.0 | 3.0 | 5.0 | 2.0 |
| 3 | 4.0 | 2 | 2.0 | 4 | 2 | 5 | 2 | 1 | 2 | 6 | 1.0 | 2.0 | NaN | NaN | NaN | NaN | 9.0 | 4.0 | 8.0 | 2.0 | 5 | 1 | 2 | 1 | 4 | 4 | 7 | 4 | 3 | 4 | 4 | 5 | 4 | 4 | 1.0 | 1.0 | 3 | 1.0 | NaN | 1.0 | 4.0 | 7.0 | 2.0 | 12.0 | 4.0 | 1.0 | 0.0 | 4.0 | 4.0 | 2.0 | 6.0 | 4.0 | NaN | 4.0 | 1.0 | NaN | 2.0 | 2.0 | 2.0 | 0.0 | 1.0 | 3.0 | 4.0 | 2.0 | 3.0 | 3.0 |
| 4 | 3.0 | 1 | 5.0 | 4 | 3 | 4 | 1 | 3 | 2 | 5 | 5.0 | 3.0 | 32.0 | 10.0 | 10.0 | 5.0 | 3.0 | 2.0 | 8.0 | 5.0 | 6 | 4 | 4 | 2 | 7 | 4 | 4 | 6 | 2 | 3 | 2 | 2 | 4 | 2 | 2.0 | 2.0 | 4 | 5.0 | 2.0 | 1.0 | 4.0 | 3.0 | 6.0 | 43.0 | 1.0 | 4.0 | 1.0 | 3.0 | 2.0 | 5.0 | 1.0 | 5.0 | 3.0 | 3.0 | 5.0 | 5.0 | 2.0 | 4.0 | 2.0 | 1.0 | 2.0 | 3.0 | 3.0 | 4.0 | 6.0 | 5.0 |
pop.shape
(891221, 66)
columns_with_few_nans = pop.columns[pop.isna().mean() < 0.2].tolist()[:11]
columns_with_few_nans
['ALTERSKATEGORIE_GROB', 'ANREDE_KZ', 'CJT_GESAMTTYP', 'FINANZ_MINIMALIST', 'FINANZ_SPARER', 'FINANZ_VORSORGER', 'FINANZ_ANLEGER', 'FINANZ_UNAUFFAELLIGER', 'FINANZ_HAUSBAUER', 'FINANZTYP', 'GFK_URLAUBERTYP']
def plot_distribution(col:str) -> None:
"""
INPUT:
Col(str): column in pandas dataframe
OUTPUT:
None: only displays column rows
"""
fig, axes = plt.subplots(1, 2, figsize=(12, 6),sharex=True)
plt.subplot(1, 2, 1)
sns.countplot(x=col, data=tolerable_nans)
axes[0].set_title(f'Low Amount of NaNs - {col}')
plt.subplot(1, 2, 2)
sns.countplot(x=col, data=excess_nans)
axes[1].set_title(f'High Amount of Nans - {col}')
plt.show()
for col in columns_with_few_nans:
plot_distribution(col)
- The difference between our tolerable nans and excess nans between these 5 columns with little empty values is very distinct.
pop.keys()
Index(['ALTERSKATEGORIE_GROB', 'ANREDE_KZ', 'CJT_GESAMTTYP',
'FINANZ_MINIMALIST', 'FINANZ_SPARER', 'FINANZ_VORSORGER',
'FINANZ_ANLEGER', 'FINANZ_UNAUFFAELLIGER', 'FINANZ_HAUSBAUER',
'FINANZTYP', 'GFK_URLAUBERTYP', 'HEALTH_TYP', 'LP_LEBENSPHASE_FEIN',
'LP_LEBENSPHASE_GROB', 'LP_FAMILIE_FEIN', 'LP_FAMILIE_GROB',
'LP_STATUS_FEIN', 'LP_STATUS_GROB', 'PRAEGENDE_JUGENDJAHRE',
'RETOURTYP_BK_S', 'SEMIO_SOZ', 'SEMIO_FAM', 'SEMIO_REL', 'SEMIO_MAT',
'SEMIO_VERT', 'SEMIO_LUST', 'SEMIO_ERL', 'SEMIO_KULT', 'SEMIO_RAT',
'SEMIO_KRIT', 'SEMIO_DOM', 'SEMIO_KAEM', 'SEMIO_PFLICHT', 'SEMIO_TRADV',
'SHOPPER_TYP', 'VERS_TYP', 'ZABEOTYP', 'HH_EINKOMMEN_SCORE',
'W_KEIT_KIND_HH', 'GEBAEUDETYP', 'KONSUMNAEHE', 'WOHNLAGE',
'CAMEO_DEUG_2015', 'CAMEO_INTL_2015', 'KBA05_ANTG1', 'KBA05_ANTG2',
'KBA05_ANTG3', 'KBA05_GBZ', 'BALLRAUM', 'EWDICHTE', 'INNENSTADT',
'GEBAEUDETYP_RASTER', 'KKK', 'MOBI_REGIO', 'ONLINE_AFFINITAET',
'REGIOTYP', 'PLZ8_ANTG1', 'PLZ8_ANTG2', 'PLZ8_ANTG3', 'PLZ8_ANTG4',
'PLZ8_BAUMAX', 'PLZ8_HHZ', 'PLZ8_GBZ', 'ARBEIT', 'ORTSGR_KLS9',
'RELAT_AB'],
dtype='object')
Discussion 1.1.3: Assess Missing Data in Each Row¶
- The dimension
Anrede_Kz, a little ofAlterskategorie_grobshow a similarity in how their tolerable nans are compared to their excesses - In most cases though, the data with high amount of missing dataa is very different from the ones without missing data. THis means, using data with few or no missing values will cause incorrect predicitions.
Step 1.2: Select and Re-Encode Features¶
Checking for missing data isn't the only way in which you can prepare a dataset for analysis. Since the unsupervised learning techniques to be used will only work on data that is encoded numerically, you need to make a few encoding changes or additional assumptions to be able to make progress. In addition, while almost all of the values in the dataset are encoded using numbers, not all of them represent numeric values. Check the third column of the feature summary (feat_info) for a summary of types of measurement.
- For numeric and interval data, these features can be kept without changes.
- Most of the variables in the dataset are ordinal in nature. While ordinal values may technically be non-linear in spacing, make the simplifying assumption that the ordinal variables can be treated as being interval in nature (that is, kept without any changes).
- Special handling may be necessary for the remaining two variable types: categorical, and 'mixed'.
In the first two parts of this sub-step, you will perform an investigation of the categorical and mixed-type features and make a decision on each of them, whether you will keep, drop, or re-encode each. Then, in the last part, you will create a new data frame with only the selected and engineered columns.
Data wrangling is often the trickiest part of the data analysis process, and there's a lot of it to be done here. But stick with it: once you're done with this step, you'll be ready to get to the machine learning parts of the project!
# How many features are there of each data type?
feat_info['type'].value_counts()
type ordinal 49 categorical 21 numeric 7 mixed 7 interval 1 Name: count, dtype: int64
• Leaving numeric, ordinal and interval data as they are, we are left to wrangle the: categorical and mixed data types
feat_info.query('type=="categorical" or type=="mixed"') #finding the categorical columns
| attribute | information_level | type | missing_or_unknown | |
|---|---|---|---|---|
| 0 | AGER_TYP | person | categorical | [-1,0] |
| 2 | ANREDE_KZ | person | categorical | [-1,0] |
| 3 | CJT_GESAMTTYP | person | categorical | [0] |
| 10 | FINANZTYP | person | categorical | [-1] |
| 12 | GFK_URLAUBERTYP | person | categorical | [] |
| 13 | GREEN_AVANTGARDE | person | categorical | [] |
| 15 | LP_LEBENSPHASE_FEIN | person | mixed | [0] |
| 16 | LP_LEBENSPHASE_GROB | person | mixed | [0] |
| 17 | LP_FAMILIE_FEIN | person | categorical | [0] |
| 18 | LP_FAMILIE_GROB | person | categorical | [0] |
| 19 | LP_STATUS_FEIN | person | categorical | [0] |
| 20 | LP_STATUS_GROB | person | categorical | [0] |
| 21 | NATIONALITAET_KZ | person | categorical | [-1,0] |
| 22 | PRAEGENDE_JUGENDJAHRE | person | mixed | [-1,0] |
| 38 | SHOPPER_TYP | person | categorical | [-1] |
| 39 | SOHO_KZ | person | categorical | [-1] |
| 40 | TITEL_KZ | person | categorical | [-1,0] |
| 41 | VERS_TYP | person | categorical | [-1] |
| 42 | ZABEOTYP | person | categorical | [-1,9] |
| 47 | KK_KUNDENTYP | household | categorical | [-1] |
| 52 | GEBAEUDETYP | building | categorical | [-1,0] |
| 55 | OST_WEST_KZ | building | categorical | [-1] |
| 56 | WOHNLAGE | building | mixed | [-1] |
| 57 | CAMEO_DEUG_2015 | microcell_rr4 | categorical | [-1,X] |
| 58 | CAMEO_DEU_2015 | microcell_rr4 | categorical | [XX] |
| 59 | CAMEO_INTL_2015 | microcell_rr4 | mixed | [-1,XX] |
| 64 | KBA05_BAUMAX | microcell_rr3 | mixed | [-1,0] |
| 79 | PLZ8_BAUMAX | macrocell_plz8 | mixed | [-1,0] |
cat_dimensions = feat_info.query('type=="categorical" or type=="mixed"')['attribute'] #finding these categorical columns
cat_dimensions
0 AGER_TYP 2 ANREDE_KZ 3 CJT_GESAMTTYP 10 FINANZTYP 12 GFK_URLAUBERTYP 13 GREEN_AVANTGARDE 15 LP_LEBENSPHASE_FEIN 16 LP_LEBENSPHASE_GROB 17 LP_FAMILIE_FEIN 18 LP_FAMILIE_GROB 19 LP_STATUS_FEIN 20 LP_STATUS_GROB 21 NATIONALITAET_KZ 22 PRAEGENDE_JUGENDJAHRE 38 SHOPPER_TYP 39 SOHO_KZ 40 TITEL_KZ 41 VERS_TYP 42 ZABEOTYP 47 KK_KUNDENTYP 52 GEBAEUDETYP 55 OST_WEST_KZ 56 WOHNLAGE 57 CAMEO_DEUG_2015 58 CAMEO_DEU_2015 59 CAMEO_INTL_2015 64 KBA05_BAUMAX 79 PLZ8_BAUMAX Name: attribute, dtype: object
pop.head()
| ALTERSKATEGORIE_GROB | ANREDE_KZ | CJT_GESAMTTYP | FINANZ_MINIMALIST | FINANZ_SPARER | FINANZ_VORSORGER | FINANZ_ANLEGER | FINANZ_UNAUFFAELLIGER | FINANZ_HAUSBAUER | FINANZTYP | GFK_URLAUBERTYP | HEALTH_TYP | LP_LEBENSPHASE_FEIN | LP_LEBENSPHASE_GROB | LP_FAMILIE_FEIN | LP_FAMILIE_GROB | LP_STATUS_FEIN | LP_STATUS_GROB | PRAEGENDE_JUGENDJAHRE | RETOURTYP_BK_S | SEMIO_SOZ | SEMIO_FAM | SEMIO_REL | SEMIO_MAT | SEMIO_VERT | SEMIO_LUST | SEMIO_ERL | SEMIO_KULT | SEMIO_RAT | SEMIO_KRIT | SEMIO_DOM | SEMIO_KAEM | SEMIO_PFLICHT | SEMIO_TRADV | SHOPPER_TYP | VERS_TYP | ZABEOTYP | HH_EINKOMMEN_SCORE | W_KEIT_KIND_HH | GEBAEUDETYP | KONSUMNAEHE | WOHNLAGE | CAMEO_DEUG_2015 | CAMEO_INTL_2015 | KBA05_ANTG1 | KBA05_ANTG2 | KBA05_ANTG3 | KBA05_GBZ | BALLRAUM | EWDICHTE | INNENSTADT | GEBAEUDETYP_RASTER | KKK | MOBI_REGIO | ONLINE_AFFINITAET | REGIOTYP | PLZ8_ANTG1 | PLZ8_ANTG2 | PLZ8_ANTG3 | PLZ8_ANTG4 | PLZ8_BAUMAX | PLZ8_HHZ | PLZ8_GBZ | ARBEIT | ORTSGR_KLS9 | RELAT_AB | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2.0 | 1 | 2.0 | 3 | 4 | 3 | 5 | 5 | 3 | 4 | 10.0 | NaN | 15.0 | 4.0 | 2.0 | 2.0 | 1.0 | 1.0 | NaN | 5.0 | 2 | 6 | 7 | 5 | 1 | 5 | 3 | 3 | 4 | 7 | 6 | 6 | 5 | 3 | NaN | NaN | 3 | 2.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1 | 1.0 | 2 | 5.0 | 1 | 5 | 2 | 5 | 4 | 5 | 1 | 10.0 | 3.0 | 21.0 | 6.0 | 5.0 | 3.0 | 2.0 | 1.0 | 14.0 | 1.0 | 5 | 4 | 4 | 3 | 1 | 2 | 2 | 3 | 6 | 4 | 7 | 4 | 7 | 6 | 3.0 | 2.0 | 5 | 6.0 | 3.0 | 8.0 | 1.0 | 4.0 | 8.0 | 51.0 | 0.0 | 0.0 | 0.0 | 1.0 | 6.0 | 3.0 | 8.0 | 3.0 | 2.0 | 1.0 | 3.0 | 3.0 | 2.0 | 3.0 | 2.0 | 1.0 | 1.0 | 5.0 | 4.0 | 3.0 | 5.0 | 4.0 |
| 2 | 3.0 | 2 | 3.0 | 1 | 4 | 1 | 2 | 3 | 5 | 1 | 10.0 | 3.0 | 3.0 | 1.0 | 1.0 | 1.0 | 3.0 | 2.0 | 15.0 | 3.0 | 4 | 1 | 3 | 3 | 4 | 4 | 6 | 3 | 4 | 7 | 7 | 7 | 3 | 3 | 2.0 | 1.0 | 5 | 4.0 | 3.0 | 1.0 | 5.0 | 2.0 | 4.0 | 24.0 | 1.0 | 3.0 | 1.0 | 3.0 | 2.0 | 4.0 | 4.0 | 4.0 | 2.0 | 3.0 | 2.0 | 2.0 | 3.0 | 3.0 | 1.0 | 0.0 | 1.0 | 4.0 | 4.0 | 3.0 | 5.0 | 2.0 |
| 3 | 4.0 | 2 | 2.0 | 4 | 2 | 5 | 2 | 1 | 2 | 6 | 1.0 | 2.0 | NaN | NaN | NaN | NaN | 9.0 | 4.0 | 8.0 | 2.0 | 5 | 1 | 2 | 1 | 4 | 4 | 7 | 4 | 3 | 4 | 4 | 5 | 4 | 4 | 1.0 | 1.0 | 3 | 1.0 | NaN | 1.0 | 4.0 | 7.0 | 2.0 | 12.0 | 4.0 | 1.0 | 0.0 | 4.0 | 4.0 | 2.0 | 6.0 | 4.0 | NaN | 4.0 | 1.0 | NaN | 2.0 | 2.0 | 2.0 | 0.0 | 1.0 | 3.0 | 4.0 | 2.0 | 3.0 | 3.0 |
| 4 | 3.0 | 1 | 5.0 | 4 | 3 | 4 | 1 | 3 | 2 | 5 | 5.0 | 3.0 | 32.0 | 10.0 | 10.0 | 5.0 | 3.0 | 2.0 | 8.0 | 5.0 | 6 | 4 | 4 | 2 | 7 | 4 | 4 | 6 | 2 | 3 | 2 | 2 | 4 | 2 | 2.0 | 2.0 | 4 | 5.0 | 2.0 | 1.0 | 4.0 | 3.0 | 6.0 | 43.0 | 1.0 | 4.0 | 1.0 | 3.0 | 2.0 | 5.0 | 1.0 | 5.0 | 3.0 | 3.0 | 5.0 | 5.0 | 2.0 | 4.0 | 2.0 | 1.0 | 2.0 | 3.0 | 3.0 | 4.0 | 6.0 | 5.0 |
#pop['AGER_TYP'].value_counts().reset_index().iloc[:,0].values
for col in cat_dimensions:
try:
counte = pop[col].value_counts().reset_index().iloc[:,0].values
print(f'{col} has: {counte}')
except KeyError:
continue
ANREDE_KZ has: [2 1] CJT_GESAMTTYP has: [4. 3. 6. 2. 5. 1.] FINANZTYP has: [6 1 4 2 5 3] GFK_URLAUBERTYP has: [12. 5. 10. 8. 11. 4. 9. 3. 1. 2. 7. 6.] LP_LEBENSPHASE_FEIN has: [ 1. 5. 6. 2. 8. 11. 29. 7. 13. 10. 31. 12. 30. 15. 3. 19. 37. 4. 14. 20. 32. 39. 40. 27. 16. 38. 35. 34. 9. 21. 28. 24. 36. 25. 23. 22. 18. 33. 17. 26.] LP_LEBENSPHASE_GROB has: [ 2. 1. 3. 12. 4. 5. 9. 10. 11. 8. 6. 7.] LP_FAMILIE_FEIN has: [ 1. 10. 2. 11. 8. 7. 4. 5. 9. 6. 3.] LP_FAMILIE_GROB has: [1. 5. 2. 4. 3.] LP_STATUS_FEIN has: [ 1. 9. 2. 10. 4. 5. 3. 6. 8. 7.] LP_STATUS_GROB has: [1. 2. 4. 5. 3.] PRAEGENDE_JUGENDJAHRE has: [14. 8. 5. 10. 3. 15. 11. 9. 6. 12. 1. 4. 2. 13. 7.] SHOPPER_TYP has: [1. 2. 3. 0.] VERS_TYP has: [2. 1.] ZABEOTYP has: [3 4 1 5 6 2] GEBAEUDETYP has: [1. 3. 8. 2. 4. 6. 5.] WOHNLAGE has: [3. 7. 4. 2. 5. 1. 8. 0.] CAMEO_DEUG_2015 has: [8. 9. 6. 4. 3. 2. 7. 5. 1.] CAMEO_INTL_2015 has: [51. 41. 24. 14. 43. 54. 25. 22. 23. 13. 45. 55. 52. 31. 34. 15. 44. 12. 35. 32. 33.] PLZ8_BAUMAX has: [1. 5. 2. 4. 3.]
multi_col = []
for col in cat_dimensions:
column_data = pop.get(col)
if column_data is not None:
counte = column_data.value_counts().reset_index().iloc[:, 0].values
print(f'{col} has: {counte} values')
if len(counte) > 2:
multi_col.append(col)
ANREDE_KZ has: [2 1] values CJT_GESAMTTYP has: [4. 3. 6. 2. 5. 1.] values FINANZTYP has: [6 1 4 2 5 3] values GFK_URLAUBERTYP has: [12. 5. 10. 8. 11. 4. 9. 3. 1. 2. 7. 6.] values LP_LEBENSPHASE_FEIN has: [ 1. 5. 6. 2. 8. 11. 29. 7. 13. 10. 31. 12. 30. 15. 3. 19. 37. 4. 14. 20. 32. 39. 40. 27. 16. 38. 35. 34. 9. 21. 28. 24. 36. 25. 23. 22. 18. 33. 17. 26.] values LP_LEBENSPHASE_GROB has: [ 2. 1. 3. 12. 4. 5. 9. 10. 11. 8. 6. 7.] values LP_FAMILIE_FEIN has: [ 1. 10. 2. 11. 8. 7. 4. 5. 9. 6. 3.] values LP_FAMILIE_GROB has: [1. 5. 2. 4. 3.] values LP_STATUS_FEIN has: [ 1. 9. 2. 10. 4. 5. 3. 6. 8. 7.] values LP_STATUS_GROB has: [1. 2. 4. 5. 3.] values PRAEGENDE_JUGENDJAHRE has: [14. 8. 5. 10. 3. 15. 11. 9. 6. 12. 1. 4. 2. 13. 7.] values SHOPPER_TYP has: [1. 2. 3. 0.] values VERS_TYP has: [2. 1.] values ZABEOTYP has: [3 4 1 5 6 2] values GEBAEUDETYP has: [1. 3. 8. 2. 4. 6. 5.] values WOHNLAGE has: [3. 7. 4. 2. 5. 1. 8. 0.] values CAMEO_DEUG_2015 has: [8. 9. 6. 4. 3. 2. 7. 5. 1.] values CAMEO_INTL_2015 has: [51. 41. 24. 14. 43. 54. 25. 22. 23. 13. 45. 55. 52. 31. 34. 15. 44. 12. 35. 32. 33.] values PLZ8_BAUMAX has: [1. 5. 2. 4. 3.] values
print(multi_col)
['CJT_GESAMTTYP', 'FINANZTYP', 'GFK_URLAUBERTYP', 'LP_LEBENSPHASE_FEIN', 'LP_LEBENSPHASE_GROB', 'LP_FAMILIE_FEIN', 'LP_FAMILIE_GROB', 'LP_STATUS_FEIN', 'LP_STATUS_GROB', 'PRAEGENDE_JUGENDJAHRE', 'SHOPPER_TYP', 'ZABEOTYP', 'GEBAEUDETYP', 'WOHNLAGE', 'CAMEO_DEUG_2015', 'CAMEO_INTL_2015', 'PLZ8_BAUMAX']
Step 1.2.1: Re-Encode Categorical Features¶
For categorical data, you would ordinarily need to encode the levels as dummy variables. Depending on the number of categories, perform one of the following:
- For binary (two-level) categoricals that take numeric values, you can keep them without needing to do anything.
- There is one binary variable that takes on non-numeric values. For this one, you need to re-encode the values as numbers or create a dummy variable.(We already did this earlier)
- For multi-level categoricals (three or more values), you can choose to encode the values using multiple dummy variables (e.g. via OneHotEncoder), or (to keep things straightforward) just drop them from the analysis. As always, document your choices in the Discussion section.
- Basically, columns in the to_dummy list are the ones that have more than 2 unique values, and are still categorical
Assess categorical variables: which are binary, which are multi-level, and which one needs to be re-encoded?
valid_columns = [col for col in multi_col if col in pop.columns] # Subset the DataFrame to only columns present in the multi_col list
#both needed for later encoding
valid_columns.remove('CAMEO_INTL_2015')
valid_columns.remove('PRAEGENDE_JUGENDJAHRE')
print(valid_columns)
['CJT_GESAMTTYP', 'FINANZTYP', 'GFK_URLAUBERTYP', 'LP_LEBENSPHASE_FEIN', 'LP_LEBENSPHASE_GROB', 'LP_FAMILIE_FEIN', 'LP_FAMILIE_GROB', 'LP_STATUS_FEIN', 'LP_STATUS_GROB', 'SHOPPER_TYP', 'ZABEOTYP', 'GEBAEUDETYP', 'WOHNLAGE', 'CAMEO_DEUG_2015', 'PLZ8_BAUMAX']
pop = pop.drop(columns=valid_columns)
# Re-encode categorical variable(s) to be kept in the analysis.
Discussion 1.2.1: Re-Encode Categorical Features¶
(Double-click this cell and replace this text with your own text, reporting your findings and decisions regarding categorical features. Which ones did you keep, which did you drop, and what engineering steps did you perform?)
feat_info.query('type=="mixed"')['attribute'].to_list()
['LP_LEBENSPHASE_FEIN', 'LP_LEBENSPHASE_GROB', 'PRAEGENDE_JUGENDJAHRE', 'WOHNLAGE', 'CAMEO_INTL_2015', 'KBA05_BAUMAX', 'PLZ8_BAUMAX']
Step 1.2.2: Engineer Mixed-Type Features¶
There are a handful of features that are marked as "mixed" in the feature summary that require special treatment in order to be included in the analysis. There are two in particular that deserve attention; the handling of the rest are up to your own choices:
- "PRAEGENDE_JUGENDJAHRE" combines information on three dimensions: generation by decade, movement (mainstream vs. avantgarde), and nation (east vs. west). While there aren't enough levels to disentangle east from west, you should create two new variables to capture the other two dimensions: an interval-type variable for decade, and a binary variable for movement.
- "CAMEO_INTL_2015" combines information on two axes: wealth and life stage. Break up the two-digit codes by their 'tens'-place and 'ones'-place digits into two new ordinal variables (which, for the purposes of this project, is equivalent to just treating them as their raw numeric values).
- If you decide to keep or engineer new features around the other mixed-type features, make sure you note your steps in the Discussion section.
Reading Data_Dictionary.md, we can understand the steps described above
pop['CAMEO_INTL_2015'].value_counts().reset_index().sort_values(by='CAMEO_INTL_2015')
| CAMEO_INTL_2015 | count | |
|---|---|---|
| 17 | 12.0 | 13249 |
| 9 | 13.0 | 26336 |
| 3 | 14.0 | 62884 |
| 15 | 15.0 | 16974 |
| 7 | 22.0 | 33155 |
| 8 | 23.0 | 26750 |
| 2 | 24.0 | 91158 |
| 6 | 25.0 | 39628 |
| 13 | 31.0 | 19024 |
| 19 | 32.0 | 10354 |
| 20 | 33.0 | 9935 |
| 14 | 34.0 | 18524 |
| 18 | 35.0 | 10356 |
| 1 | 41.0 | 92336 |
| 4 | 43.0 | 56672 |
| 16 | 44.0 | 14820 |
| 10 | 45.0 | 26132 |
| 0 | 51.0 | 133694 |
| 12 | 52.0 | 20542 |
| 5 | 54.0 | 45391 |
| 11 | 55.0 | 23955 |
CAMEO_INTL_2015 DECODING¶
German CAMEO: Wealth / Life Stage Typology, mapped to international code
- -1: unknown
- 11: Wealthy Households - Pre-Family Couples & Singles
- 12: Wealthy Households - Young Couples With Children
- 13: Wealthy Households - Families With School Age Children
- 14: Wealthy Households - Older Families & Mature Couples
- 15: Wealthy Households - Elders In Retirement
- 21: Prosperous Households - Pre-Family Couples & Singles
- 22: Prosperous Households - Young Couples With Children
- 23: Prosperous Households - Families With School Age Children
- 24: Prosperous Households - Older Families & Mature Couples
- 25: Prosperous Households - Elders In Retirement
- 31: Comfortable Households - Pre-Family Couples & Singles
- 32: Comfortable Households - Young Couples With Children
- 33: Comfortable Households - Families With School Age Children
- 34: Comfortable Households - Older Families & Mature Couples
- 35: Comfortable Households - Elders In Retirement
- 41: Less Affluent Households - Pre-Family Couples & Singles
- 42: Less Affluent Households - Young Couples With Children
- 43: Less Affluent Households - Families With School Age Children
- 44: Less Affluent Households - Older Families & Mature Couples
- 45: Less Affluent Households - Elders In Retirement
- 51: Poorer Households - Pre-Family Couples & Singles
- 52: Poorer Households - Young Couples With Children
- 53: Poorer Households - Families With School Age Children
- 54: Poorer Households - Older Families & Mature Couples
- 55: Poorer Households - Elders In Retirement
- XX: unknown
wealth = []
life_stage = []
for i in pop['CAMEO_INTL_2015'].astype(float):
if i//10==1.0:
wealth.append(0) #wealthy households
elif i//10==2.0:
wealth.append(1) #prosperous households
elif i//10==3.0:
wealth.append(2) #comfortable households
elif i//10==4.0:
wealth.append(3) #less affluent households
else:
wealth.append(4) #poorer households
if i%10==1.0:
life_stage.append(0) #pre-family couple
elif i%10==2.0:
life_stage.append(1) #young couple
elif i%10==3.0:
life_stage.append(2) #family with school age
elif i%10==4.0:
life_stage.append(3) #mature couples
else:
life_stage.append(4) #elder in retirement
pop['wealth'] = wealth #encoding wealth with floor division
pop['life_stage'] = life_stage #encoding wealth with reminder
pop.keys()
Index(['ALTERSKATEGORIE_GROB', 'ANREDE_KZ', 'FINANZ_MINIMALIST',
'FINANZ_SPARER', 'FINANZ_VORSORGER', 'FINANZ_ANLEGER',
'FINANZ_UNAUFFAELLIGER', 'FINANZ_HAUSBAUER', 'HEALTH_TYP',
'PRAEGENDE_JUGENDJAHRE', 'RETOURTYP_BK_S', 'SEMIO_SOZ', 'SEMIO_FAM',
'SEMIO_REL', 'SEMIO_MAT', 'SEMIO_VERT', 'SEMIO_LUST', 'SEMIO_ERL',
'SEMIO_KULT', 'SEMIO_RAT', 'SEMIO_KRIT', 'SEMIO_DOM', 'SEMIO_KAEM',
'SEMIO_PFLICHT', 'SEMIO_TRADV', 'VERS_TYP', 'HH_EINKOMMEN_SCORE',
'W_KEIT_KIND_HH', 'KONSUMNAEHE', 'CAMEO_INTL_2015', 'KBA05_ANTG1',
'KBA05_ANTG2', 'KBA05_ANTG3', 'KBA05_GBZ', 'BALLRAUM', 'EWDICHTE',
'INNENSTADT', 'GEBAEUDETYP_RASTER', 'KKK', 'MOBI_REGIO',
'ONLINE_AFFINITAET', 'REGIOTYP', 'PLZ8_ANTG1', 'PLZ8_ANTG2',
'PLZ8_ANTG3', 'PLZ8_ANTG4', 'PLZ8_HHZ', 'PLZ8_GBZ', 'ARBEIT',
'ORTSGR_KLS9', 'RELAT_AB', 'wealth', 'life_stage'],
dtype='object')
pop.columns.to_list()[:5]
['ALTERSKATEGORIE_GROB', 'ANREDE_KZ', 'FINANZ_MINIMALIST', 'FINANZ_SPARER', 'FINANZ_VORSORGER']
PRAEGENDE_JUGENDJAHRE DECODING¶
Dominating movement of person's youth (avantgarde vs. mainstream; east vs. west)
- -1: unknown
- 0: unknown
- 1: 40s - war years (Mainstream, E+W)
- 2: 40s - reconstruction years (Avantgarde, E+W)
- 3: 50s - economic miracle (Mainstream, E+W)
- 4: 50s - milk bar / Individualisation (Avantgarde, E+W)
- 5: 60s - economic miracle (Mainstream, E+W)
- 6: 60s - generation 68 / student protestors (Avantgarde, W)
- 7: 60s - opponents to the building of the Wall (Avantgarde, E)
- 8: 70s - family orientation (Mainstream, E+W)
- 9: 70s - peace movement (Avantgarde, E+W)
- 10: 80s - Generation Golf (Mainstream, W)
- 11: 80s - ecological awareness (Avantgarde, W)
- 12: 80s - FDJ / communist party youth organisation (Mainstream, E)
- 13: 80s - Swords into ploughshares (Avantgarde, E)
- 14: 90s - digital media kids (Mainstream, E+W)
- 15: 90s - ecological awareness (Avantgarde, E+W)
pop['PRAEGENDE_JUGENDJAHRE'].value_counts().reset_index().sort_values(by='PRAEGENDE_JUGENDJAHRE')
| PRAEGENDE_JUGENDJAHRE | count | |
|---|---|---|
| 10 | 1.0 | 21282 |
| 12 | 2.0 | 7479 |
| 4 | 3.0 | 55195 |
| 11 | 4.0 | 20451 |
| 2 | 5.0 | 86416 |
| 8 | 6.0 | 25652 |
| 14 | 7.0 | 4010 |
| 1 | 8.0 | 145988 |
| 7 | 9.0 | 33570 |
| 3 | 10.0 | 85808 |
| 6 | 11.0 | 35752 |
| 9 | 12.0 | 24446 |
| 13 | 13.0 | 5764 |
| 0 | 14.0 | 188697 |
| 5 | 15.0 | 42547 |
#generation by decade, movement (mainstream vs. avantgarde)
generation = []
movement = []
for i in pop['PRAEGENDE_JUGENDJAHRE'].astype(float):
if i == 1 or i == 2:
generation.append(0) #40s
elif i == 3 or i == 4:
generation.append(1) #50s
elif i == 5 or i == 6 or i == 7:
generation.append(2) #60s
elif i == 8 or i == 9:
generation.append(3) #90s
elif i == 14 or i == 15:
generation.append(4) #80s
else:
generation.append(5)
if i == 1 or i == 3 or i == 5 or i == 8 or i == 10 or i == 12 or i == 14:
movement.append(0) #mainstream movement
else:
movement.append(1) #avantgarde movement
pop['generation']=generation #encoding generation
pop['movement']=movement #encoding movement type
mixed_cols = feat_info.query('type=="mixed"')['attribute'].to_list()
mixed_cols
['LP_LEBENSPHASE_FEIN', 'LP_LEBENSPHASE_GROB', 'PRAEGENDE_JUGENDJAHRE', 'WOHNLAGE', 'CAMEO_INTL_2015', 'KBA05_BAUMAX', 'PLZ8_BAUMAX']
mixed_to_drop = []
for col in mixed_cols:
if (col != 'PRAEGENDE_JUGENDJAHRE') and (col != 'CAMEO_INTL_2015'):
mixed_to_drop.append(col)
mixed_to_drop
['LP_LEBENSPHASE_FEIN', 'LP_LEBENSPHASE_GROB', 'WOHNLAGE', 'KBA05_BAUMAX', 'PLZ8_BAUMAX']
for col in mixed_to_drop:
try:
pop = pop.drop(columns=col)
except KeyError:
continue
pop.keys()
Index(['ALTERSKATEGORIE_GROB', 'ANREDE_KZ', 'FINANZ_MINIMALIST',
'FINANZ_SPARER', 'FINANZ_VORSORGER', 'FINANZ_ANLEGER',
'FINANZ_UNAUFFAELLIGER', 'FINANZ_HAUSBAUER', 'HEALTH_TYP',
'PRAEGENDE_JUGENDJAHRE', 'RETOURTYP_BK_S', 'SEMIO_SOZ', 'SEMIO_FAM',
'SEMIO_REL', 'SEMIO_MAT', 'SEMIO_VERT', 'SEMIO_LUST', 'SEMIO_ERL',
'SEMIO_KULT', 'SEMIO_RAT', 'SEMIO_KRIT', 'SEMIO_DOM', 'SEMIO_KAEM',
'SEMIO_PFLICHT', 'SEMIO_TRADV', 'VERS_TYP', 'HH_EINKOMMEN_SCORE',
'W_KEIT_KIND_HH', 'KONSUMNAEHE', 'CAMEO_INTL_2015', 'KBA05_ANTG1',
'KBA05_ANTG2', 'KBA05_ANTG3', 'KBA05_GBZ', 'BALLRAUM', 'EWDICHTE',
'INNENSTADT', 'GEBAEUDETYP_RASTER', 'KKK', 'MOBI_REGIO',
'ONLINE_AFFINITAET', 'REGIOTYP', 'PLZ8_ANTG1', 'PLZ8_ANTG2',
'PLZ8_ANTG3', 'PLZ8_ANTG4', 'PLZ8_HHZ', 'PLZ8_GBZ', 'ARBEIT',
'ORTSGR_KLS9', 'RELAT_AB', 'wealth', 'life_stage', 'generation',
'movement'],
dtype='object')
pop.columns
Index(['ALTERSKATEGORIE_GROB', 'ANREDE_KZ', 'FINANZ_MINIMALIST',
'FINANZ_SPARER', 'FINANZ_VORSORGER', 'FINANZ_ANLEGER',
'FINANZ_UNAUFFAELLIGER', 'FINANZ_HAUSBAUER', 'HEALTH_TYP',
'PRAEGENDE_JUGENDJAHRE', 'RETOURTYP_BK_S', 'SEMIO_SOZ', 'SEMIO_FAM',
'SEMIO_REL', 'SEMIO_MAT', 'SEMIO_VERT', 'SEMIO_LUST', 'SEMIO_ERL',
'SEMIO_KULT', 'SEMIO_RAT', 'SEMIO_KRIT', 'SEMIO_DOM', 'SEMIO_KAEM',
'SEMIO_PFLICHT', 'SEMIO_TRADV', 'VERS_TYP', 'HH_EINKOMMEN_SCORE',
'W_KEIT_KIND_HH', 'KONSUMNAEHE', 'CAMEO_INTL_2015', 'KBA05_ANTG1',
'KBA05_ANTG2', 'KBA05_ANTG3', 'KBA05_GBZ', 'BALLRAUM', 'EWDICHTE',
'INNENSTADT', 'GEBAEUDETYP_RASTER', 'KKK', 'MOBI_REGIO',
'ONLINE_AFFINITAET', 'REGIOTYP', 'PLZ8_ANTG1', 'PLZ8_ANTG2',
'PLZ8_ANTG3', 'PLZ8_ANTG4', 'PLZ8_HHZ', 'PLZ8_GBZ', 'ARBEIT',
'ORTSGR_KLS9', 'RELAT_AB', 'wealth', 'life_stage', 'generation',
'movement'],
dtype='object')
pop.isna().sum().sum()
3031039
pop.select_dtypes(include='number').keys()
Index(['ALTERSKATEGORIE_GROB', 'ANREDE_KZ', 'FINANZ_MINIMALIST',
'FINANZ_SPARER', 'FINANZ_VORSORGER', 'FINANZ_ANLEGER',
'FINANZ_UNAUFFAELLIGER', 'FINANZ_HAUSBAUER', 'HEALTH_TYP',
'PRAEGENDE_JUGENDJAHRE', 'RETOURTYP_BK_S', 'SEMIO_SOZ', 'SEMIO_FAM',
'SEMIO_REL', 'SEMIO_MAT', 'SEMIO_VERT', 'SEMIO_LUST', 'SEMIO_ERL',
'SEMIO_KULT', 'SEMIO_RAT', 'SEMIO_KRIT', 'SEMIO_DOM', 'SEMIO_KAEM',
'SEMIO_PFLICHT', 'SEMIO_TRADV', 'VERS_TYP', 'HH_EINKOMMEN_SCORE',
'W_KEIT_KIND_HH', 'KONSUMNAEHE', 'CAMEO_INTL_2015', 'KBA05_ANTG1',
'KBA05_ANTG2', 'KBA05_ANTG3', 'KBA05_GBZ', 'BALLRAUM', 'EWDICHTE',
'INNENSTADT', 'GEBAEUDETYP_RASTER', 'KKK', 'MOBI_REGIO',
'ONLINE_AFFINITAET', 'REGIOTYP', 'PLZ8_ANTG1', 'PLZ8_ANTG2',
'PLZ8_ANTG3', 'PLZ8_ANTG4', 'PLZ8_HHZ', 'PLZ8_GBZ', 'ARBEIT',
'ORTSGR_KLS9', 'RELAT_AB', 'wealth', 'life_stage', 'generation',
'movement'],
dtype='object')
Discussion 1.2.2: Engineer Mixed-Type Features¶
- I investigated "PRAEGENDE_JUGENDJAHRE" and "CAMEO_INTL_2015" to engineer two new variables each.
- The other dimensions, I decided tto drop them
Step 1.2.3: Complete Feature Selection¶
In order to finish this step up, you need to make sure that your data frame now only has the columns that you want to keep. To summarize, the dataframe should consist of the following:
- All numeric, interval, and ordinal type columns from the original dataset.
- Binary categorical features (all numerically-encoded).
- Engineered features from other multi-level categorical features and mixed features.
Make sure that for any new columns that you have engineered, that you've excluded the original columns from the final dataset. Otherwise, their values will interfere with the analysis later on the project. For example, you should not keep "PRAEGENDE_JUGENDJAHRE", since its values won't be useful for the algorithm: only the values derived from it in the engineered features you created should be retained. As a reminder, your data should only be from the subset with few or no missing values.
# If there are other re-engineering tasks you need to perform, make sure you
# take care of them here. (Dealing with missing data will come in step 2.1.)
pop.keys()
Index(['ALTERSKATEGORIE_GROB', 'ANREDE_KZ', 'FINANZ_MINIMALIST',
'FINANZ_SPARER', 'FINANZ_VORSORGER', 'FINANZ_ANLEGER',
'FINANZ_UNAUFFAELLIGER', 'FINANZ_HAUSBAUER', 'HEALTH_TYP',
'PRAEGENDE_JUGENDJAHRE', 'RETOURTYP_BK_S', 'SEMIO_SOZ', 'SEMIO_FAM',
'SEMIO_REL', 'SEMIO_MAT', 'SEMIO_VERT', 'SEMIO_LUST', 'SEMIO_ERL',
'SEMIO_KULT', 'SEMIO_RAT', 'SEMIO_KRIT', 'SEMIO_DOM', 'SEMIO_KAEM',
'SEMIO_PFLICHT', 'SEMIO_TRADV', 'VERS_TYP', 'HH_EINKOMMEN_SCORE',
'W_KEIT_KIND_HH', 'KONSUMNAEHE', 'CAMEO_INTL_2015', 'KBA05_ANTG1',
'KBA05_ANTG2', 'KBA05_ANTG3', 'KBA05_GBZ', 'BALLRAUM', 'EWDICHTE',
'INNENSTADT', 'GEBAEUDETYP_RASTER', 'KKK', 'MOBI_REGIO',
'ONLINE_AFFINITAET', 'REGIOTYP', 'PLZ8_ANTG1', 'PLZ8_ANTG2',
'PLZ8_ANTG3', 'PLZ8_ANTG4', 'PLZ8_HHZ', 'PLZ8_GBZ', 'ARBEIT',
'ORTSGR_KLS9', 'RELAT_AB', 'wealth', 'life_stage', 'generation',
'movement'],
dtype='object')
pop.shape
(891221, 85)
# Do whatever you need to in order to ensure that the dataframe only contains
# the columns that should be passed to the algorithm functions.
Step 1.3: Create a Cleaning Function¶
Even though you've finished cleaning up the general population demographics data, it's important to look ahead to the future and realize that you'll need to perform the same cleaning steps on the customer demographics data. In this substep, complete the function below to execute the main feature selection, encoding, and re-engineering steps you performed above. Then, when it comes to looking at the customer data in Step 3, you can just run this function on that DataFrame to get the trimmed dataset in a single step.
def clean_data(df:pd.DataFrame) -> pd.DataFrame:
"""
Perform feature trimming, re-encoding, and engineering for population data
INPUT: Sample DataFrame
OUTPUT: Trimmed, cleaned and Median Imputed DataFrame
"""
# Put in code here to execute all main cleaning steps:
for x in range(len(feat_info)):
# using regex to extract the digits
missing_data = re.sub('[\[|\]]', '',feat_info.iloc[x]['missing_or_unknown']).split(',')
# return the digit values for integer strings only
if missing_data != ['']:
missing_data = [np.int64(data) if (data!='X' and data!='XX') else data for data in missing_data]
df = df.replace({feat_info.iloc[x]['attribute']: missing_data}, np.nan)
df['CAMEO_INTL_2015'] = pd.to_numeric(df['CAMEO_INTL_2015'], downcast='integer', errors='coerce')
df['CAMEO_DEUG_2015'] = pd.to_numeric(df['CAMEO_DEUG_2015'], downcast='integer', errors='coerce')
df = df.drop(columns=['TITEL_KZ', 'CAMEO_DEU_2015','AGER_TYP','KK_KUNDENTYP','KBA05_BAUMAX','GEBURTSJAHR','ALTER_HH', 'KBA13_ANZAHL_PKW', 'KBA05_ANTG3'])
df['OST_WEST_KZ'] = df['OST_WEST_KZ'].replace({'W':0, 'O':1})
df['OST_WEST_KZ'] = pd.to_numeric(df['OST_WEST_KZ'], errors='coerce') #setting the errors to Nans
df_outliers = identify_outlier_columns(df)
df = df.drop(columns=df_outliers)
# select, re-encode, and engineer column values.
multi_col = []
for col in cat_dimensions:
column_data = df.get(col)
if column_data is not None:
counte = column_data.value_counts().reset_index().iloc[:, 0].values
#print(f'{col} has: {counte} values')
if len(counte) > 2:
multi_col.append(col)
valid_columns = [col for col in multi_col if col in df.columns]
valid_columns.remove('CAMEO_INTL_2015')
valid_columns.remove('PRAEGENDE_JUGENDJAHRE')
df = df.drop(columns=valid_columns)
wealth = []
life_stage = []
for i in df['CAMEO_INTL_2015'].astype(float):
if i//10==1.0:
wealth.append(0) #wealthy households
elif i//10==2.0:
wealth.append(1) #prosperous households
elif i//10==3.0:
wealth.append(2) #comfortable households
elif i//10==4.0:
wealth.append(3) #less affluent households
else:
wealth.append(4) #poorer households
if i%10==1.0:
life_stage.append(0) #pre-family couple
elif i%10==2.0:
life_stage.append(1) #young couple
elif i%10==3.0:
life_stage.append(2) #family with school age
elif i%10==4.0:
life_stage.append(3) #mature couples
else:
life_stage.append(4) #elder in retirement
df['wealth'] = wealth #encoding wealth with floor division
df['life_stage'] = life_stage #encoding wealth with reminder
generation = []
movement = []
for i in df['PRAEGENDE_JUGENDJAHRE'].astype(float):
if i == 1 or i == 2:
generation.append(0) #40s
elif i == 3 or i == 4:
generation.append(1) #50s
elif i == 5 or i == 6 or i == 7:
generation.append(2) #60s
elif i == 8 or i == 9:
generation.append(3) #90s
elif i == 14 or i == 15:
generation.append(4) #80s
else:
generation.append(5)
if i == 1 or i == 3 or i == 5 or i == 8 or i == 10 or i == 12 or i == 14:
movement.append(0) #mainstream movement
else:
movement.append(1) #avantgarde movement
df['generation']=generation #encoding generation
df['movement']=movement #encoding movement type
mixed_to_drop = []
for col in mixed_cols:
if (col != 'PRAEGENDE_JUGENDJAHRE') and (col != 'CAMEO_INTL_2015'):
mixed_to_drop.append(col)
for col in mixed_to_drop:
try:
df = df.drop(columns=col)
except KeyError:
continue
df_columns = df.keys().to_list()
df = pd.DataFrame(df, columns=df_columns)
return df # Returning our cleaned dataframe.
#reinputing just to ensure we dont use dilluted data
pop = pd.read_csv('Udacity_AZDIAS_Subset.csv', sep=';')
df = clean_data(pop)
df.head()
| ALTERSKATEGORIE_GROB | ANREDE_KZ | FINANZ_MINIMALIST | FINANZ_SPARER | FINANZ_VORSORGER | FINANZ_ANLEGER | FINANZ_UNAUFFAELLIGER | FINANZ_HAUSBAUER | HEALTH_TYP | PRAEGENDE_JUGENDJAHRE | RETOURTYP_BK_S | SEMIO_SOZ | SEMIO_FAM | SEMIO_REL | SEMIO_MAT | SEMIO_VERT | SEMIO_LUST | SEMIO_ERL | SEMIO_KULT | SEMIO_RAT | SEMIO_KRIT | SEMIO_DOM | SEMIO_KAEM | SEMIO_PFLICHT | SEMIO_TRADV | VERS_TYP | HH_EINKOMMEN_SCORE | W_KEIT_KIND_HH | KONSUMNAEHE | CAMEO_INTL_2015 | KBA05_ANTG1 | KBA05_ANTG2 | KBA05_GBZ | BALLRAUM | EWDICHTE | INNENSTADT | GEBAEUDETYP_RASTER | KKK | MOBI_REGIO | ONLINE_AFFINITAET | REGIOTYP | PLZ8_ANTG1 | PLZ8_ANTG2 | PLZ8_ANTG3 | PLZ8_ANTG4 | PLZ8_HHZ | PLZ8_GBZ | ARBEIT | ORTSGR_KLS9 | RELAT_AB | wealth | life_stage | generation | movement | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2.0 | 1 | 3 | 4 | 3 | 5 | 5 | 3 | NaN | NaN | 5.0 | 2 | 6 | 7 | 5 | 1 | 5 | 3 | 3 | 4 | 7 | 6 | 6 | 5 | 3 | NaN | 2.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 4 | 4 | 5 | 1 |
| 1 | 1.0 | 2 | 1 | 5 | 2 | 5 | 4 | 5 | 3.0 | 14.0 | 1.0 | 5 | 4 | 4 | 3 | 1 | 2 | 2 | 3 | 6 | 4 | 7 | 4 | 7 | 6 | 2.0 | 6.0 | 3.0 | 1.0 | 51.0 | 0.0 | 0.0 | 1.0 | 6.0 | 3.0 | 8.0 | 3.0 | 2.0 | 1.0 | 3.0 | 3.0 | 2.0 | 3.0 | 2.0 | 1.0 | 5.0 | 4.0 | 3.0 | 5.0 | 4.0 | 4 | 0 | 4 | 0 |
| 2 | 3.0 | 2 | 1 | 4 | 1 | 2 | 3 | 5 | 3.0 | 15.0 | 3.0 | 4 | 1 | 3 | 3 | 4 | 4 | 6 | 3 | 4 | 7 | 7 | 7 | 3 | 3 | 1.0 | 4.0 | 3.0 | 5.0 | 24.0 | 1.0 | 3.0 | 3.0 | 2.0 | 4.0 | 4.0 | 4.0 | 2.0 | 3.0 | 2.0 | 2.0 | 3.0 | 3.0 | 1.0 | 0.0 | 4.0 | 4.0 | 3.0 | 5.0 | 2.0 | 1 | 3 | 4 | 1 |
| 3 | 4.0 | 2 | 4 | 2 | 5 | 2 | 1 | 2 | 2.0 | 8.0 | 2.0 | 5 | 1 | 2 | 1 | 4 | 4 | 7 | 4 | 3 | 4 | 4 | 5 | 4 | 4 | 1.0 | 1.0 | NaN | 4.0 | 12.0 | 4.0 | 1.0 | 4.0 | 4.0 | 2.0 | 6.0 | 4.0 | NaN | 4.0 | 1.0 | NaN | 2.0 | 2.0 | 2.0 | 0.0 | 3.0 | 4.0 | 2.0 | 3.0 | 3.0 | 0 | 1 | 3 | 0 |
| 4 | 3.0 | 1 | 4 | 3 | 4 | 1 | 3 | 2 | 3.0 | 8.0 | 5.0 | 6 | 4 | 4 | 2 | 7 | 4 | 4 | 6 | 2 | 3 | 2 | 2 | 4 | 2 | 2.0 | 5.0 | 2.0 | 4.0 | 43.0 | 1.0 | 4.0 | 3.0 | 2.0 | 5.0 | 1.0 | 5.0 | 3.0 | 3.0 | 5.0 | 5.0 | 2.0 | 4.0 | 2.0 | 1.0 | 3.0 | 3.0 | 4.0 | 6.0 | 5.0 | 3 | 2 | 3 | 0 |
df.shape
(891221, 54)
df.isna().sum().sum() #still need to sort nans
2897715
Confirming how cleaning function works just like we intend to¶
assert pop.shape[0] == df.shape[0], "clean_data function is not working properly, rows mismatch"
print("If this is all you see, you passed the test")
If this is all you see, you passed the test
Step 2: Feature Transformation¶
Step 2.1: Apply Feature Scaling¶
Before we apply dimensionality reduction techniques to the data, we need to perform feature scaling so that the principal component vectors are not influenced by the natural differences in scale for features. Starting from this part of the project, you'll want to keep an eye on the API reference page for sklearn to help you navigate to all of the classes and functions that you'll need. In this substep, you'll need to check the following:
- sklearn requires that data not have missing values in order for its estimators to work properly. So, before applying the scaler to your data, make sure that you've cleaned the DataFrame of the remaining missing values. This can be as simple as just removing all data points with missing data, or applying an SimpleImputer to replace all missing values. You might also try a more complicated procedure where you temporarily remove missing values in order to compute the scaling parameters before re-introducing those missing values and applying imputation. Think about how much missing data you have and what possible effects each approach might have on your analysis, and justify your decision in the discussion section below.
- For the actual scaling function, a StandardScaler instance is suggested, scaling each feature to mean 0 and standard deviation 1.
- For these classes, you can make use of the
.fit_transform()method to both fit a procedure to the data as well as apply the transformation to the data at the same time. Don't forget to keep the fit sklearn objects handy, since you'll be applying them to the customer demographics data towards the end of the project.
imp_median = SimpleImputer(strategy='median') #median is less sensitive to outliers compared to mean
numer_cols = df.select_dtypes(include='number').keys()
numer_cols
Index(['ALTERSKATEGORIE_GROB', 'ANREDE_KZ', 'FINANZ_MINIMALIST',
'FINANZ_SPARER', 'FINANZ_VORSORGER', 'FINANZ_ANLEGER',
'FINANZ_UNAUFFAELLIGER', 'FINANZ_HAUSBAUER', 'HEALTH_TYP',
'PRAEGENDE_JUGENDJAHRE', 'RETOURTYP_BK_S', 'SEMIO_SOZ', 'SEMIO_FAM',
'SEMIO_REL', 'SEMIO_MAT', 'SEMIO_VERT', 'SEMIO_LUST', 'SEMIO_ERL',
'SEMIO_KULT', 'SEMIO_RAT', 'SEMIO_KRIT', 'SEMIO_DOM', 'SEMIO_KAEM',
'SEMIO_PFLICHT', 'SEMIO_TRADV', 'VERS_TYP', 'HH_EINKOMMEN_SCORE',
'W_KEIT_KIND_HH', 'KONSUMNAEHE', 'CAMEO_INTL_2015', 'KBA05_ANTG1',
'KBA05_ANTG2', 'KBA05_GBZ', 'BALLRAUM', 'EWDICHTE', 'INNENSTADT',
'GEBAEUDETYP_RASTER', 'KKK', 'MOBI_REGIO', 'ONLINE_AFFINITAET',
'REGIOTYP', 'PLZ8_ANTG1', 'PLZ8_ANTG2', 'PLZ8_ANTG3', 'PLZ8_ANTG4',
'PLZ8_HHZ', 'PLZ8_GBZ', 'ARBEIT', 'ORTSGR_KLS9', 'RELAT_AB', 'wealth',
'life_stage', 'generation', 'movement'],
dtype='object')
df[numer_cols] = imp_median.fit_transform(df[numer_cols])
df.isna().sum().sum() #verifying simple imputer
0
Sample Scaling
scaler = StandardScaler()
# Apply feature scaling to the general population data.
scaled_data = scaler.fit_transform(df)
scaled_data
array([[-0.75167967, -1.045218 , -0.05641562, ..., 1.21814982,
1.10233491, 1.46453679],
[-1.74333933, 0.95673821, -1.57035806, ..., -1.36014202,
0.39299478, -0.68280975],
[ 0.23997999, 0.95673821, -1.57035806, ..., 0.57357686,
0.39299478, 1.46453679],
...,
[-0.75167967, 0.95673821, -0.81338684, ..., 0.57357686,
0.39299478, -0.68280975],
[-1.74333933, -1.045218 , -1.57035806, ..., -1.36014202,
0.39299478, -0.68280975],
[ 1.23163965, -1.045218 , 0.7005556 , ..., -0.0709961 ,
-1.7350256 , -0.68280975]])
type(scaled_data)
numpy.ndarray
Discussion 2.1: Apply Feature Scaling¶
Scaled data is a list as expected. Now we can move on
Step 2.2: Perform Dimensionality Reduction¶
On your scaled data, you are now ready to apply dimensionality reduction techniques.
- Use sklearn's PCA class to apply principal component analysis on the data, thus finding the vectors of maximal variance in the data. To start, you should not set any parameters (so all components are computed) or set a number of components that is at least half the number of features (so there's enough features to see the general trend in variability).
- Check out the ratio of variance explained by each principal component as well as the cumulative variance explained. Try plotting the cumulative or sequential values using matplotlib's
plot()function. Based on what you find, select a value for the number of transformed features you'll retain for the clustering part of the project. - Once you've made a choice for the number of components to keep, make sure you re-fit a PCA instance to perform the decided-on transformation.
# We aim to apply PCA to the data, but first, define a funcrtion to do all of these for us
df.isna().sum().sum()
0
def apply_pca(x:int=None, df:pd.DataFrame=df):
"""
INPUT:
x: str denoting number of components to be broken down into
df: scaled dataframe to find principal components
OUTPUT:
df_pca: (np.array) denoting result
pca: object denoting PCA's instantiation
"""
scaler = StandardScaler() #including scaling in function for ease
scaled_data = scaler.fit_transform(df)
pca = PCA(n_components=x)
df_pca = pca.fit_transform(scaled_data)
return pca, df_pca
pca, df_pca = apply_pca() #base pca with 0 components
pca
PCA()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
PCA()
def scree_plot(pca):
'''
Creates a scree plot associated with the principal components
INPUT: pca - the result of instantian of PCA in scikit learn
OUTPUT:
None
'''
num_components=len(pca.explained_variance_ratio_)
ind = np.arange(num_components)
vals = pca.explained_variance_ratio_
plt.figure(figsize=(15, 11))
ax = plt.subplot(111)
cumvals = np.cumsum(vals)
ax.bar(ind, vals)
ax.plot(ind, cumvals)
for i in range(num_components):
ax.annotate(r"%s%%" % ((str(vals[i]*100)[:4])), (ind[i]+0.2, vals[i]), va="bottom", ha="center", fontsize=12)
ax.xaxis.set_tick_params(width=0)
ax.yaxis.set_tick_params(width=2, length=12)
ax.set_xlabel("Principal Component")
ax.set_ylabel("Variance-Explained Ratio")
plt.title('Explained Variance Per Principal Component')
scree_plot(pca)
def pca_results(full_dataset:pd.DataFrame, pca):
'''
Create a DataFrame of the PCA results
Includes dimension feature weights and explained variance
Visualizes the PCA results
'''
# Dimension indexing
dimensions = ['Dimension {}'.format(i) for i in range(1,len(pca.components_)+1)]
# PCA components as a datafra e
components = pd.DataFrame(np.round(pca.components_, 4), columns = full_dataset.keys())
components.index = dimensions
# PCA explained variance as another dataframe
ratios = pca.explained_variance_ratio_.reshape(len(pca.components_), 1)
variance_ratios = pd.DataFrame(np.round(ratios, 4), columns = ['Explained Variance'])
variance_ratios.index = dimensions
# Create a bar plot visualization
fig, ax = plt.subplots(figsize = (15,10))
# Plot the feature weights as a function of the components
components.plot(ax = ax, kind = 'bar');
ax.set_ylabel("Feature Weights")
ax.set_xticklabels(dimensions, rotation=0)
# Display the explained variance ratios
for i, ev in enumerate(pca.explained_variance_ratio_):
ax.text(i-0.40, ax.get_ylim()[1] + 0.05, "Explained Variance\n %.4f"%(ev))
# Return a concatenated DataFrame
return pd.concat([variance_ratios, components], axis = 1)
def pca_computer(x:int=None, df:pd.DataFrame=df):
"""
COmbination of the apply_pca and pca_result functions
INPUT:
x(str): Number of principal components to split df into
df(pd.DataFrame): Dataframe do compute pca on
OUTPUT:
zing: Nonetype denoting pca component analysis
"""
pca, df_pca = apply_pca()
zing = pca_results(df, pca)
return zing
#loop to investigate variability threshold
for comp in range(3, df.shape[1]):
pca, X_pca = apply_pca(comp, df)
comp_check = pca_results(df, pca)
if comp_check['Explained Variance'].sum() > 0.95: # Re-apply PCA to the data while selecting for number of components to retain.
break
num_comps = comp_check.shape[0]
print("Using {} components, we can explain {}% of the variability in the original data.".format(comp_check.shape[0],comp_check['Explained Variance'].sum()))
/var/folders/y1/2cln1s193dgc6xy3yzry761r0000gn/T/ipykernel_845/3037338320.py:21: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`). Consider using `matplotlib.pyplot.close()`. fig, ax = plt.subplots(figsize = (15,10))
Using 36 components, we can explain 0.9503% of the variability in the original data.
#Manually confirming this:
pca, df_pca = apply_pca(36, df)
pca
PCA(n_components=36)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
PCA(n_components=36)
pca_results(df, pca) # Investigating the variance accounted for by each principal component.
| Explained Variance | ALTERSKATEGORIE_GROB | ANREDE_KZ | FINANZ_MINIMALIST | FINANZ_SPARER | FINANZ_VORSORGER | FINANZ_ANLEGER | FINANZ_UNAUFFAELLIGER | FINANZ_HAUSBAUER | HEALTH_TYP | PRAEGENDE_JUGENDJAHRE | RETOURTYP_BK_S | SEMIO_SOZ | SEMIO_FAM | SEMIO_REL | SEMIO_MAT | SEMIO_VERT | SEMIO_LUST | SEMIO_ERL | SEMIO_KULT | SEMIO_RAT | SEMIO_KRIT | SEMIO_DOM | SEMIO_KAEM | SEMIO_PFLICHT | SEMIO_TRADV | VERS_TYP | HH_EINKOMMEN_SCORE | W_KEIT_KIND_HH | KONSUMNAEHE | CAMEO_INTL_2015 | KBA05_ANTG1 | KBA05_ANTG2 | KBA05_GBZ | BALLRAUM | EWDICHTE | INNENSTADT | GEBAEUDETYP_RASTER | KKK | MOBI_REGIO | ONLINE_AFFINITAET | REGIOTYP | PLZ8_ANTG1 | PLZ8_ANTG2 | PLZ8_ANTG3 | PLZ8_ANTG4 | PLZ8_HHZ | PLZ8_GBZ | ARBEIT | ORTSGR_KLS9 | RELAT_AB | wealth | life_stage | generation | movement | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Dimension 1 | 0.1868 | 0.1657 | -0.0068 | 0.2295 | -0.1946 | 0.1604 | -0.1163 | -0.1276 | -0.1310 | -0.0368 | -0.1453 | 0.0501 | -0.0333 | -0.1061 | -0.1553 | -0.1093 | 0.0563 | 0.0989 | 0.1146 | -0.0891 | -0.1315 | -0.0124 | -0.0359 | -0.0465 | -0.1599 | -0.1235 | -0.0394 | -0.1494 | -0.0046 | 0.1607 | -0.1982 | 0.2083 | -0.0174 | 0.1961 | 0.1160 | -0.1890 | 0.1542 | 0.1050 | -0.0387 | 0.2211 | 0.0126 | -0.0665 | 0.2175 | -0.1559 | -0.2205 | -0.2102 | -0.0530 | 0.1544 | -0.1332 | -0.1877 | -0.1297 | -0.1999 | 0.0952 | -0.1353 | 0.0438 |
| Dimension 2 | 0.1565 | -0.2288 | -0.0522 | -0.0296 | 0.2119 | -0.2160 | 0.2056 | 0.2107 | -0.1489 | 0.0504 | 0.2384 | -0.1630 | 0.0471 | 0.1445 | 0.2130 | 0.1391 | 0.0168 | -0.1590 | -0.1931 | 0.1615 | 0.1631 | -0.0604 | -0.0321 | -0.0690 | 0.2043 | 0.2029 | -0.0221 | -0.1113 | -0.1339 | 0.1107 | -0.1248 | 0.1196 | -0.0065 | 0.1296 | 0.0970 | -0.1397 | 0.1179 | 0.0747 | -0.0061 | 0.1417 | 0.1650 | -0.0310 | 0.1412 | -0.1037 | -0.1415 | -0.1319 | -0.0207 | 0.1008 | -0.1059 | -0.1404 | -0.1015 | -0.1039 | 0.0483 | 0.2014 | 0.0452 |
| Dimension 3 | 0.1102 | -0.0264 | 0.3586 | -0.1161 | 0.0702 | -0.0532 | 0.1660 | 0.0923 | 0.0422 | 0.0019 | 0.0434 | -0.0721 | -0.2941 | -0.2405 | -0.0665 | -0.0767 | -0.3534 | -0.0032 | 0.1879 | -0.2920 | 0.1667 | 0.3105 | 0.3339 | 0.3603 | 0.0344 | 0.0073 | 0.0428 | -0.0469 | -0.0702 | 0.0301 | -0.0184 | 0.0155 | -0.0053 | 0.0207 | 0.0378 | -0.0307 | 0.0395 | 0.0364 | 0.0110 | 0.0318 | 0.0160 | 0.0052 | 0.0319 | -0.0184 | -0.0239 | -0.0236 | 0.0109 | 0.0214 | -0.0345 | -0.0375 | -0.0271 | 0.0056 | 0.0654 | 0.0843 | 0.0480 |
| Dimension 4 | 0.0505 | -0.0013 | -0.0399 | 0.1386 | -0.0205 | 0.0093 | -0.1181 | 0.1021 | -0.1463 | -0.0104 | 0.0228 | 0.0578 | -0.0115 | 0.0564 | 0.0470 | 0.0410 | -0.0143 | 0.0710 | -0.0718 | -0.0427 | -0.0221 | 0.0370 | 0.0830 | 0.0437 | -0.0150 | -0.0086 | 0.0777 | -0.3376 | -0.0598 | -0.1469 | -0.1833 | 0.1123 | -0.0345 | 0.0768 | -0.2395 | 0.2645 | -0.2268 | -0.0830 | -0.3258 | 0.0988 | 0.0963 | -0.2598 | -0.0439 | 0.0913 | 0.0987 | 0.0881 | 0.1620 | 0.0360 | 0.0303 | 0.2396 | 0.0674 | -0.1461 | 0.1560 | 0.0721 | 0.3908 |
| Dimension 5 | 0.0412 | 0.0809 | -0.1529 | 0.1052 | 0.0498 | 0.0469 | 0.1670 | 0.1747 | -0.0482 | -0.0912 | -0.1143 | 0.0827 | -0.1076 | 0.2167 | 0.2000 | 0.0478 | -0.1048 | 0.1984 | -0.1269 | -0.0363 | -0.1865 | 0.1314 | -0.0556 | -0.0233 | -0.0253 | -0.2390 | 0.2252 | -0.2022 | 0.0998 | 0.0440 | 0.1648 | -0.0609 | -0.0501 | -0.0239 | 0.1991 | -0.0847 | 0.1460 | 0.1178 | 0.1912 | 0.0031 | -0.1688 | 0.1846 | -0.0551 | 0.0587 | 0.0761 | 0.0644 | 0.0655 | -0.0501 | -0.0484 | -0.1188 | -0.0269 | 0.2616 | 0.2470 | 0.1202 | 0.2837 |
| Dimension 6 | 0.0343 | 0.0130 | -0.0313 | 0.1326 | 0.0031 | -0.0453 | 0.0270 | -0.0206 | -0.2173 | 0.1514 | 0.0779 | -0.0151 | 0.0335 | -0.0632 | -0.0456 | -0.1917 | 0.0283 | -0.0687 | 0.0470 | 0.0583 | -0.0693 | 0.0017 | 0.0137 | 0.0595 | 0.0011 | -0.0916 | 0.0939 | -0.0490 | -0.1577 | 0.1331 | -0.0249 | 0.1532 | -0.1381 | 0.1059 | -0.1502 | 0.1515 | -0.1692 | 0.1518 | 0.4030 | 0.1355 | 0.1604 | 0.3764 | 0.0262 | -0.0825 | -0.0268 | -0.0014 | -0.2370 | -0.1901 | 0.2724 | 0.2077 | 0.2296 | -0.0277 | -0.0116 | 0.0674 | 0.0006 |
| Dimension 7 | 0.0281 | -0.1349 | 0.0892 | 0.0929 | -0.1620 | 0.1609 | -0.1080 | -0.0949 | 0.0868 | -0.2773 | -0.2354 | 0.1153 | -0.0950 | 0.0690 | 0.2015 | 0.3237 | -0.0774 | -0.0125 | -0.0760 | 0.0896 | 0.2884 | 0.0253 | 0.0369 | 0.0872 | 0.2106 | 0.2407 | -0.2502 | -0.0436 | 0.2097 | 0.0885 | 0.0011 | 0.0869 | 0.0603 | 0.1104 | -0.0232 | 0.0616 | -0.0453 | 0.1612 | 0.0831 | 0.1064 | -0.1696 | 0.1051 | 0.0403 | 0.0157 | -0.0302 | -0.0731 | -0.0940 | -0.0261 | 0.1894 | 0.0680 | 0.1637 | 0.0089 | 0.1287 | -0.1816 | 0.0213 |
| Dimension 8 | 0.0270 | -0.0928 | -0.0245 | -0.0226 | -0.0139 | 0.0870 | -0.0392 | -0.0128 | 0.0143 | 0.2703 | -0.0409 | 0.0926 | 0.0301 | 0.0046 | 0.0312 | -0.1183 | -0.0290 | -0.0778 | -0.0366 | 0.0750 | 0.0148 | 0.0476 | 0.0455 | 0.0198 | 0.0900 | 0.0149 | 0.2431 | 0.0581 | 0.1289 | -0.0760 | -0.0370 | 0.0448 | 0.2253 | 0.0324 | -0.0683 | 0.0950 | -0.0501 | 0.0730 | 0.1849 | 0.0233 | -0.1059 | 0.2290 | 0.1241 | 0.0955 | -0.0310 | -0.1211 | 0.5253 | 0.5097 | -0.0092 | 0.0662 | 0.0018 | -0.0473 | -0.0685 | -0.0477 | -0.0699 |
| Dimension 9 | 0.0251 | -0.1564 | -0.0239 | -0.0391 | -0.0318 | 0.0902 | -0.0259 | -0.0627 | 0.1277 | 0.2989 | -0.1028 | 0.1635 | 0.0495 | 0.0598 | 0.1142 | -0.1082 | -0.1013 | -0.0789 | -0.1171 | 0.0457 | 0.0482 | 0.0945 | 0.0594 | 0.0068 | 0.1140 | 0.0644 | 0.3415 | -0.0389 | 0.2427 | 0.0235 | -0.1127 | -0.0160 | -0.2990 | -0.1070 | -0.1475 | 0.0147 | -0.1086 | -0.1469 | -0.0358 | -0.0724 | -0.2071 | -0.0219 | 0.0904 | -0.2772 | -0.1257 | 0.0317 | -0.2478 | -0.2032 | -0.2015 | 0.0073 | -0.2429 | -0.1057 | 0.0523 | -0.0795 | -0.0267 |
| Dimension 10 | 0.0242 | -0.0721 | -0.0439 | 0.0051 | -0.0006 | 0.0037 | -0.0245 | -0.0518 | 0.0405 | 0.3397 | -0.0012 | 0.0024 | 0.0238 | 0.0184 | -0.0131 | -0.0822 | -0.0401 | -0.0995 | -0.0719 | -0.0079 | -0.0395 | 0.0418 | 0.1013 | -0.0531 | -0.0059 | 0.0129 | 0.3056 | 0.0569 | 0.0573 | 0.0653 | 0.0959 | 0.0272 | 0.3496 | 0.1612 | 0.1920 | -0.0612 | 0.1423 | 0.1156 | -0.2434 | 0.0904 | -0.1018 | -0.3077 | -0.0344 | 0.1701 | -0.0004 | -0.1273 | -0.2598 | -0.0998 | 0.2817 | -0.0483 | 0.3314 | 0.0796 | 0.0075 | -0.0198 | 0.0147 |
| Dimension 11 | 0.0227 | 0.0501 | 0.0178 | -0.3120 | 0.1885 | -0.1162 | 0.1338 | 0.1671 | 0.3693 | -0.2981 | 0.0973 | -0.2022 | 0.0625 | -0.0134 | 0.0196 | -0.1576 | -0.0074 | 0.2382 | 0.0208 | -0.0397 | -0.1306 | -0.1006 | -0.1486 | -0.0586 | -0.0496 | -0.1180 | 0.0793 | 0.0551 | 0.2985 | 0.0660 | -0.1242 | 0.0408 | 0.0965 | 0.0154 | -0.1357 | 0.0696 | -0.0984 | 0.0555 | -0.0267 | 0.0114 | -0.0622 | 0.0010 | 0.1620 | -0.1303 | -0.1509 | -0.1547 | -0.0828 | 0.0711 | 0.1741 | 0.1131 | 0.1098 | -0.1457 | 0.0843 | -0.0657 | 0.1111 |
| Dimension 12 | 0.0199 | 0.0448 | -0.0040 | 0.0113 | 0.0268 | -0.0111 | -0.0101 | 0.0683 | -0.0608 | -0.0213 | 0.0164 | 0.0543 | -0.0462 | 0.0044 | 0.0141 | 0.0651 | -0.0061 | 0.1476 | 0.0406 | 0.0188 | 0.0062 | 0.0331 | -0.0169 | 0.0025 | 0.0364 | -0.0307 | 0.0125 | 0.0023 | 0.0547 | -0.1729 | -0.0424 | 0.0365 | 0.5722 | 0.1873 | -0.1715 | 0.0503 | -0.1199 | -0.1604 | 0.0977 | 0.1154 | -0.0043 | 0.1985 | -0.0339 | 0.2503 | 0.0123 | -0.1031 | -0.3063 | -0.1864 | -0.3271 | -0.0189 | -0.2877 | -0.0342 | -0.1448 | 0.0480 | 0.0108 |
| Dimension 13 | 0.0179 | 0.0220 | -0.0455 | -0.0493 | -0.0023 | 0.0464 | -0.0299 | 0.0157 | -0.0041 | 0.0114 | -0.0101 | 0.1670 | -0.0278 | 0.0210 | -0.0226 | 0.0612 | 0.0415 | -0.1253 | -0.0231 | -0.0228 | -0.0372 | 0.0083 | 0.0039 | 0.0151 | -0.0561 | -0.0304 | -0.0160 | 0.0459 | -0.2407 | 0.3070 | -0.1223 | -0.3035 | 0.2646 | -0.3014 | -0.1625 | 0.0741 | -0.1095 | 0.5003 | -0.0124 | -0.2345 | 0.0356 | -0.0751 | 0.0575 | 0.0354 | -0.0077 | -0.0636 | -0.0610 | -0.0163 | -0.0874 | 0.0376 | -0.1306 | -0.1197 | 0.3114 | 0.0596 | -0.0403 |
| Dimension 14 | 0.0173 | 0.1426 | -0.0451 | 0.0098 | 0.0684 | -0.1130 | -0.0115 | 0.0210 | -0.1741 | 0.0360 | 0.0752 | 0.3736 | -0.0762 | 0.0500 | -0.0005 | 0.2211 | 0.0118 | 0.1608 | 0.1022 | 0.0041 | -0.0017 | 0.0474 | 0.0343 | 0.0150 | -0.0033 | -0.0417 | -0.0901 | 0.1194 | -0.1335 | 0.0092 | 0.1732 | -0.0479 | 0.0053 | -0.0417 | -0.0971 | 0.0418 | -0.0978 | -0.1235 | -0.0891 | -0.0694 | -0.2784 | -0.0748 | 0.2400 | -0.2787 | -0.2757 | -0.2033 | -0.0599 | 0.1358 | 0.1060 | 0.0454 | 0.1036 | 0.1741 | -0.2588 | 0.2701 | 0.0928 |
| Dimension 15 | 0.0149 | -0.0771 | 0.0276 | 0.0035 | -0.0758 | 0.2070 | -0.0693 | -0.0942 | 0.1716 | -0.1809 | -0.1029 | -0.1433 | 0.0014 | 0.0307 | 0.1042 | -0.0752 | -0.0806 | 0.0803 | -0.0688 | -0.0061 | 0.0330 | -0.0963 | -0.0065 | -0.0005 | 0.0905 | 0.0264 | 0.2847 | -0.0069 | -0.2821 | 0.0597 | 0.3561 | 0.0565 | -0.0056 | 0.0353 | -0.1441 | 0.0480 | -0.0771 | 0.1422 | -0.0647 | 0.0283 | 0.3863 | -0.0986 | 0.1096 | -0.0853 | -0.1041 | -0.0881 | -0.0198 | 0.0582 | -0.0998 | 0.0423 | -0.1413 | 0.3516 | -0.2286 | -0.1271 | 0.0959 |
| Dimension 16 | 0.0143 | 0.0138 | 0.0221 | -0.0171 | 0.0819 | -0.1219 | 0.0063 | 0.0872 | -0.0916 | 0.0567 | 0.0936 | -0.0021 | 0.0703 | -0.0568 | -0.0769 | -0.0495 | 0.0219 | -0.0616 | 0.0210 | -0.0225 | -0.0239 | 0.0069 | 0.0072 | -0.0088 | -0.0550 | -0.0060 | -0.1558 | -0.0116 | 0.3251 | 0.2163 | 0.1419 | 0.1778 | -0.1983 | 0.1367 | -0.1194 | 0.0408 | -0.1283 | 0.5094 | -0.1243 | 0.1339 | -0.2410 | -0.1004 | -0.1039 | 0.1448 | 0.1184 | 0.0776 | 0.0442 | -0.0232 | -0.1381 | 0.0079 | -0.2045 | 0.1427 | -0.3283 | 0.0129 | 0.0155 |
| Dimension 17 | 0.0132 | -0.0513 | 0.0724 | -0.1327 | 0.0444 | 0.0227 | 0.0857 | -0.0353 | 0.1319 | -0.2714 | 0.0213 | 0.5456 | 0.0844 | -0.0428 | 0.0738 | -0.1402 | -0.0260 | 0.0862 | 0.0723 | 0.0274 | 0.0690 | -0.1567 | -0.0839 | 0.0109 | 0.1151 | 0.0292 | 0.2423 | 0.0742 | -0.2950 | 0.0297 | -0.2095 | 0.1166 | -0.1142 | 0.0546 | 0.1951 | -0.0073 | 0.0356 | 0.0845 | 0.0432 | 0.0630 | -0.0818 | -0.0341 | -0.1135 | 0.1783 | 0.1249 | 0.0853 | 0.0100 | -0.0381 | -0.0312 | -0.0878 | 0.0692 | -0.2188 | -0.2321 | 0.0457 | 0.0569 |
| Dimension 18 | 0.0128 | 0.1220 | -0.0376 | 0.0012 | -0.0203 | 0.0020 | -0.0649 | -0.0815 | 0.0760 | 0.2913 | -0.0188 | -0.1724 | 0.0093 | 0.1033 | 0.0271 | 0.3524 | -0.0068 | 0.5576 | 0.1496 | 0.0511 | 0.0757 | 0.0708 | 0.0926 | 0.0015 | 0.0487 | 0.0846 | 0.1185 | -0.0544 | 0.0831 | 0.1586 | -0.1931 | -0.0804 | -0.0482 | -0.0989 | 0.0719 | -0.0479 | 0.0612 | 0.1565 | -0.0182 | -0.0875 | 0.2291 | 0.0893 | -0.0686 | -0.0026 | 0.0566 | 0.0748 | 0.0243 | -0.0185 | 0.0640 | -0.0139 | 0.0302 | -0.1875 | -0.2939 | 0.0288 | 0.0158 |
| Dimension 19 | 0.0115 | 0.0630 | -0.0366 | -0.0707 | 0.0647 | -0.1692 | 0.0992 | -0.0731 | 0.0998 | 0.1374 | -0.0165 | 0.1091 | 0.0145 | 0.0321 | -0.0165 | 0.1862 | 0.0039 | 0.2871 | 0.0808 | 0.0297 | 0.0003 | -0.0186 | 0.0176 | -0.0211 | -0.0192 | 0.0113 | 0.0781 | 0.2094 | -0.1864 | -0.0126 | 0.1303 | 0.1879 | -0.0923 | 0.2802 | -0.1437 | -0.0245 | -0.1561 | -0.0046 | -0.0227 | 0.2086 | -0.0169 | -0.0915 | -0.1194 | -0.0176 | 0.0617 | 0.0495 | 0.0969 | 0.0192 | 0.1029 | -0.0319 | -0.1996 | 0.0798 | 0.4844 | -0.0583 | -0.3522 |
| Dimension 20 | 0.0106 | 0.0316 | -0.0486 | 0.1303 | 0.0091 | -0.1137 | -0.1323 | 0.1284 | -0.3322 | -0.4147 | 0.0159 | -0.1673 | -0.1159 | 0.0035 | -0.0557 | 0.1537 | 0.0748 | -0.0468 | 0.0366 | -0.0450 | 0.0527 | 0.1790 | -0.0644 | -0.0117 | -0.1139 | 0.1457 | 0.5280 | 0.0596 | 0.0727 | -0.0173 | -0.0549 | -0.0397 | -0.1086 | -0.0479 | -0.0749 | 0.0361 | 0.0157 | 0.0163 | -0.0359 | -0.0396 | -0.0982 | -0.0320 | 0.0137 | 0.0989 | -0.0264 | -0.0637 | -0.0236 | 0.0224 | -0.0372 | -0.0015 | 0.1014 | -0.0432 | -0.0436 | -0.0301 | -0.3711 |
| Dimension 21 | 0.0101 | -0.0132 | 0.0482 | -0.0466 | -0.0118 | 0.0490 | -0.1228 | 0.0997 | -0.0609 | 0.0917 | 0.0449 | 0.0158 | 0.2853 | -0.0643 | -0.0316 | -0.0425 | 0.0430 | 0.1893 | 0.0921 | 0.0287 | 0.0554 | -0.0410 | 0.0661 | 0.0661 | 0.1033 | 0.0405 | -0.0681 | 0.1150 | 0.0446 | -0.3308 | 0.1049 | 0.0910 | -0.3062 | -0.1780 | 0.0409 | 0.0828 | 0.1115 | 0.1443 | -0.0452 | -0.0986 | 0.0657 | 0.0781 | 0.1865 | 0.4081 | -0.0761 | -0.3004 | -0.2055 | 0.0707 | -0.1655 | -0.0308 | 0.1056 | 0.0493 | 0.2475 | 0.0496 | 0.0639 |
| Dimension 22 | 0.0092 | -0.0180 | -0.0615 | -0.0807 | 0.0991 | -0.0485 | 0.0611 | 0.2360 | 0.0468 | 0.0888 | -0.0440 | 0.4282 | -0.2997 | 0.0224 | -0.0099 | 0.0589 | -0.0210 | -0.0650 | -0.1413 | -0.0461 | -0.0583 | 0.1015 | -0.0815 | -0.0490 | -0.1547 | -0.0643 | -0.1021 | -0.1202 | 0.2391 | -0.0646 | 0.0054 | -0.0288 | -0.0937 | -0.0023 | 0.0958 | 0.0381 | -0.1465 | 0.0079 | -0.0751 | 0.0188 | 0.5184 | 0.0259 | 0.0380 | 0.0601 | -0.0451 | -0.1003 | -0.0408 | 0.0258 | 0.0038 | -0.0855 | 0.1162 | 0.0345 | -0.0807 | -0.2151 | -0.2306 |
| Dimension 23 | 0.0091 | -0.0744 | 0.1077 | 0.1088 | -0.0379 | -0.1414 | -0.0385 | -0.0832 | -0.1755 | -0.1512 | 0.1331 | 0.2489 | 0.4365 | -0.1499 | -0.0945 | -0.0876 | 0.0729 | 0.0803 | 0.1283 | -0.0278 | 0.1095 | 0.0308 | -0.0008 | 0.0922 | 0.0682 | 0.1359 | 0.1201 | -0.0393 | 0.3651 | 0.1861 | 0.1546 | -0.0884 | 0.2030 | -0.0250 | 0.1513 | -0.0207 | -0.0685 | -0.0916 | -0.0275 | -0.0459 | 0.2865 | 0.0260 | -0.0486 | -0.1863 | 0.0098 | 0.0975 | 0.0682 | -0.0044 | 0.0788 | -0.0553 | -0.1005 | 0.0979 | 0.1712 | 0.0914 | 0.0856 |
| Dimension 24 | 0.0087 | -0.0177 | 0.0145 | -0.0077 | -0.0086 | 0.0428 | 0.0091 | -0.0327 | 0.0004 | 0.0157 | 0.0067 | 0.0178 | -0.0487 | -0.0158 | 0.0251 | -0.0455 | -0.0298 | 0.0001 | -0.0136 | -0.0027 | -0.0065 | 0.0047 | -0.0410 | -0.0417 | 0.0551 | -0.0234 | -0.0323 | -0.0044 | 0.0265 | 0.6116 | 0.0304 | -0.0568 | -0.1622 | -0.0231 | -0.3029 | -0.1338 | 0.0371 | -0.4271 | -0.0167 | -0.0150 | 0.0279 | 0.0237 | -0.0898 | 0.4421 | 0.0160 | -0.2249 | 0.0004 | 0.1057 | 0.0744 | -0.0536 | -0.0047 | 0.0165 | 0.0161 | 0.0491 | 0.0583 |
| Dimension 25 | 0.0080 | 0.0508 | 0.0217 | -0.0784 | 0.0586 | -0.0331 | 0.0309 | 0.2325 | 0.0851 | -0.0172 | -0.1972 | 0.0044 | 0.6089 | 0.1538 | 0.0269 | 0.1455 | -0.0024 | -0.1053 | 0.0204 | -0.0794 | -0.1076 | 0.4226 | 0.0663 | 0.1053 | -0.0284 | -0.1095 | -0.0927 | -0.1236 | -0.2202 | 0.1140 | -0.0364 | -0.0135 | 0.0287 | 0.0814 | -0.0290 | 0.0527 | 0.0537 | -0.0829 | 0.0356 | 0.0498 | -0.0401 | -0.0236 | 0.0488 | -0.0150 | -0.0170 | 0.0166 | 0.0083 | -0.0050 | -0.0883 | 0.0188 | 0.1156 | 0.0151 | -0.1184 | -0.2595 | -0.1109 |
| Dimension 26 | 0.0077 | 0.1404 | -0.0590 | -0.1254 | -0.0307 | 0.0115 | -0.0318 | 0.0176 | 0.2612 | 0.0518 | 0.0873 | -0.0159 | -0.0111 | -0.0132 | -0.1890 | 0.3954 | 0.1616 | -0.3369 | 0.1465 | -0.0101 | 0.0825 | -0.0793 | 0.0538 | 0.0174 | -0.3369 | 0.1488 | 0.1986 | 0.0229 | 0.0079 | 0.0778 | 0.0445 | 0.0751 | -0.0856 | 0.0535 | 0.2714 | 0.1108 | -0.1262 | -0.0964 | 0.1987 | 0.0378 | 0.0050 | 0.0262 | 0.0474 | 0.0908 | -0.0093 | -0.0751 | -0.0395 | 0.0258 | -0.0312 | 0.0907 | -0.1444 | 0.0216 | 0.0448 | -0.0079 | 0.3267 |
| Dimension 27 | 0.0074 | -0.0432 | -0.0030 | 0.0174 | 0.0102 | -0.0365 | -0.1380 | 0.2001 | -0.1075 | 0.0605 | -0.0550 | -0.1465 | -0.1081 | 0.0389 | 0.0473 | -0.1995 | -0.0001 | 0.2128 | -0.0266 | 0.0417 | -0.0028 | -0.1324 | 0.0316 | 0.0636 | 0.0651 | 0.0560 | -0.0832 | 0.0373 | -0.1159 | 0.3172 | 0.0340 | 0.0827 | 0.0608 | -0.0910 | 0.5142 | 0.2545 | -0.3735 | -0.1572 | -0.0606 | -0.0145 | -0.1497 | -0.0006 | 0.0972 | 0.0271 | -0.0342 | 0.0564 | 0.0259 | -0.0293 | -0.2008 | 0.0689 | 0.0623 | 0.0355 | 0.0487 | -0.1398 | -0.1206 |
| Dimension 28 | 0.0070 | 0.0156 | -0.0224 | -0.0856 | 0.0583 | -0.0929 | -0.1230 | 0.3137 | -0.0288 | 0.1099 | -0.0302 | 0.1367 | -0.1347 | -0.0595 | -0.0524 | -0.0514 | 0.1107 | 0.0769 | 0.1395 | 0.0701 | 0.1267 | -0.2394 | 0.0425 | 0.0212 | -0.0966 | 0.1806 | 0.0331 | -0.0205 | -0.0112 | 0.1355 | 0.1006 | 0.0988 | 0.1045 | -0.0412 | -0.2757 | 0.0639 | 0.5154 | -0.0376 | 0.0487 | 0.0281 | -0.0167 | 0.0236 | 0.0487 | -0.1431 | 0.0406 | 0.2203 | 0.0441 | -0.0725 | -0.2104 | 0.0745 | 0.1832 | 0.0720 | 0.0762 | -0.2464 | 0.0882 |
| Dimension 29 | 0.0066 | 0.0872 | -0.0576 | -0.1133 | -0.0356 | 0.2669 | -0.1574 | -0.0655 | 0.0779 | -0.0716 | 0.1146 | 0.0011 | -0.0639 | 0.0296 | 0.0154 | 0.1020 | 0.0097 | -0.1180 | -0.0146 | 0.0409 | -0.0971 | 0.2258 | 0.0023 | -0.0422 | 0.2029 | -0.2260 | -0.0213 | 0.3560 | 0.1551 | 0.1576 | -0.0247 | 0.1683 | -0.0073 | 0.1384 | 0.0379 | 0.1799 | 0.1398 | -0.0214 | -0.0577 | 0.0786 | 0.1873 | -0.0320 | 0.0506 | -0.0777 | -0.0031 | 0.1844 | 0.0549 | -0.0552 | -0.2977 | 0.0934 | 0.2058 | -0.0277 | 0.0868 | 0.3736 | -0.0674 |
| Dimension 30 | 0.0059 | 0.0597 | -0.1046 | -0.0148 | -0.0860 | -0.0959 | -0.1864 | 0.0279 | 0.1747 | -0.0417 | 0.0615 | 0.0399 | -0.1585 | 0.2114 | 0.0544 | -0.3988 | 0.1818 | 0.0746 | 0.1197 | 0.1754 | -0.0319 | 0.5452 | -0.0647 | -0.0716 | -0.0663 | 0.4172 | -0.1218 | 0.0360 | -0.0708 | -0.0394 | 0.0282 | 0.0411 | -0.0035 | -0.0011 | 0.0266 | -0.0533 | 0.0358 | 0.0312 | 0.0914 | 0.0003 | 0.0276 | -0.0752 | -0.0269 | 0.0199 | 0.0141 | -0.0327 | -0.0136 | 0.0082 | 0.1019 | 0.0450 | -0.1339 | 0.0052 | 0.0107 | 0.0782 | 0.0949 |
| Dimension 31 | 0.0058 | -0.0667 | -0.0832 | -0.0029 | -0.0240 | -0.0739 | -0.0559 | -0.0857 | 0.0477 | -0.1671 | 0.0657 | 0.0936 | 0.0600 | 0.0059 | -0.0118 | -0.0136 | -0.0549 | 0.1093 | -0.4161 | -0.0083 | -0.1852 | -0.0843 | 0.5210 | 0.0243 | -0.2189 | 0.0400 | -0.0509 | -0.0233 | 0.0146 | 0.0550 | -0.0316 | 0.1060 | 0.0098 | -0.0598 | 0.0879 | 0.1492 | 0.3035 | -0.0464 | -0.0105 | 0.0024 | 0.0127 | 0.0773 | 0.0972 | 0.0562 | 0.0159 | -0.0328 | -0.0263 | -0.0093 | 0.1568 | 0.2424 | -0.2691 | -0.0142 | -0.0674 | 0.1063 | -0.1859 |
| Dimension 32 | 0.0056 | 0.1468 | -0.0197 | 0.0452 | 0.1493 | -0.2450 | 0.0739 | -0.1011 | -0.1370 | -0.0377 | -0.1541 | 0.0143 | -0.1302 | 0.0947 | 0.0389 | 0.0269 | 0.0461 | -0.1586 | 0.2798 | 0.1374 | -0.0861 | 0.0026 | 0.0497 | 0.0239 | 0.4403 | -0.2554 | 0.0267 | 0.0652 | 0.0886 | -0.0133 | -0.0152 | 0.0766 | -0.0013 | -0.1149 | 0.1171 | 0.1122 | 0.1763 | -0.0227 | 0.0907 | -0.0587 | 0.0999 | -0.2332 | 0.0586 | 0.0849 | 0.0043 | -0.0461 | -0.0281 | 0.0055 | 0.1445 | 0.2420 | -0.2782 | -0.0329 | -0.0245 | -0.2634 | 0.0631 |
| Dimension 33 | 0.0053 | 0.1259 | 0.0020 | -0.0128 | 0.0345 | 0.0694 | 0.1570 | -0.1202 | 0.0028 | -0.0526 | -0.0836 | 0.0029 | 0.0096 | 0.5478 | -0.0467 | -0.1397 | -0.0095 | -0.0831 | 0.2577 | -0.0658 | -0.1962 | -0.3112 | 0.2348 | 0.2374 | 0.0060 | 0.2453 | -0.0199 | -0.1301 | 0.1381 | -0.0129 | -0.0155 | -0.1691 | -0.0330 | 0.1478 | -0.0654 | 0.0490 | -0.0483 | 0.0031 | 0.0751 | 0.0240 | 0.0545 | -0.0934 | -0.0297 | 0.0089 | -0.0324 | -0.0485 | -0.0090 | 0.0299 | -0.1481 | -0.0731 | 0.1562 | 0.0146 | 0.0070 | 0.1625 | -0.1535 |
| Dimension 34 | 0.0051 | 0.0540 | -0.0911 | -0.0015 | 0.0689 | 0.1057 | 0.2760 | -0.1692 | -0.0498 | -0.0475 | -0.0330 | -0.0198 | 0.0507 | -0.0381 | 0.0886 | -0.0066 | -0.0350 | 0.0291 | 0.0429 | 0.1771 | 0.1416 | 0.0716 | 0.1457 | -0.0777 | -0.2167 | -0.0196 | -0.0218 | -0.0885 | 0.0781 | 0.0027 | 0.0250 | 0.4515 | 0.1420 | -0.4172 | -0.1063 | -0.1057 | -0.1657 | -0.0482 | 0.2283 | -0.0127 | 0.0269 | -0.2897 | 0.1777 | 0.0743 | 0.0492 | 0.1127 | 0.0367 | -0.0534 | -0.0588 | -0.1536 | 0.1500 | 0.0168 | -0.0111 | 0.0436 | -0.0652 |
| Dimension 35 | 0.0049 | 0.0249 | -0.1455 | 0.0236 | -0.0014 | 0.0277 | -0.0449 | 0.0875 | -0.0094 | -0.0525 | -0.0350 | -0.0257 | 0.0470 | -0.0787 | 0.2795 | -0.1720 | 0.0099 | 0.0262 | 0.1347 | 0.3072 | 0.4620 | -0.0248 | 0.2586 | -0.0446 | -0.2712 | -0.3384 | -0.0262 | 0.0400 | 0.0491 | -0.0104 | -0.0251 | -0.3488 | -0.0809 | 0.2530 | 0.0383 | 0.0135 | -0.0111 | 0.0116 | 0.0808 | 0.0189 | 0.0322 | -0.1452 | -0.1002 | -0.0253 | -0.0159 | -0.0914 | -0.0181 | 0.0529 | -0.0507 | 0.0747 | -0.0030 | -0.0176 | -0.0032 | 0.0367 | -0.0358 |
| Dimension 36 | 0.0049 | 0.0839 | -0.0754 | -0.0258 | 0.0431 | -0.0650 | 0.0364 | 0.0209 | -0.0410 | -0.0820 | -0.0365 | -0.0006 | -0.0453 | -0.2976 | -0.1225 | 0.0456 | 0.1801 | 0.0713 | -0.2387 | 0.0692 | -0.2627 | 0.0474 | 0.4259 | -0.1632 | 0.2694 | 0.1602 | 0.0115 | 0.1254 | 0.0062 | -0.0279 | -0.0025 | -0.1814 | -0.0542 | 0.1335 | -0.0557 | -0.1685 | -0.1968 | 0.0183 | 0.2352 | -0.0381 | 0.0074 | -0.1351 | -0.0998 | -0.0507 | 0.0041 | -0.0158 | 0.0026 | 0.0399 | -0.1760 | -0.1216 | 0.1900 | -0.0257 | 0.0073 | -0.1768 | 0.2009 |
pca_results(df, pca)['Explained Variance'].sum()
#Confirmed
0.9503
Discussion 2.2: Perform Dimensionality Reduction¶
(Double-click this cell and replace this text with your own text, reporting your findings and decisions regarding dimensionality reduction. How many principal components / transformed features are you retaining for the next step of the analysis?)
• PCA being a component of data optimization using data reduction, in this particular evaluation, we are attempting to explain 95% of our data's variability.
• Using the For-Loop and a variance threshold, we were able to discover 36 n_components were just right.
Step 2.3: Interpret Principal Components¶
Now that we have our transformed principal components, it's a nice idea to check out the weight of each variable on the first few components to see if they can be interpreted in some fashion.
As a reminder, each principal component is a unit vector that points in the direction of highest variance (after accounting for the variance captured by earlier principal components). The further a weight is from zero, the more the principal component is in the direction of the corresponding feature. If two features have large weights of the same sign (both positive or both negative), then increases in one tend expect to be associated with increases in the other. To contrast, features with different signs can be expected to show a negative correlation: increases in one variable should result in a decrease in the other.
- To investigate the features, you should map each weight to their corresponding feature name, then sort the features according to weight. The most interesting features for each principal component, then, will be those at the beginning and end of the sorted list. Use the data dictionary document to help you understand these most prominent features, their relationships, and what a positive or negative value on the principal component might indicate.
- You should investigate and interpret feature associations from the first three principal components in this substep. To help facilitate this, you should write a function that you can call at any time to print the sorted list of feature weights, for the i-th principal component. This might come in handy in the next step of the project, when you interpret the tendencies of the discovered clusters.
# HINT: Try defining a function here or in a new cell that you can reuse in the
# other cells.
def weights(n:int, pca, df:pd.DataFrame) -> pd.DataFrame:
"""
INPUT:
n(int): index ot principal component to analyze
pca(sklearn.decomposition.PCA): a PCA fitted object containing eigen vectors
df(pd.DataFrame): a dataframe to examine weights
OUTPUT:
df(pd.DataFrame): result dataframe having columns as index, and corresponding weights
"""
col_list = list(df.keys())
weight=list(pca.components_[n])
df =pd.DataFrame(list(zip(col_list, weight)))
df = df.set_axis(['Dimension', 'Weights'], axis=1)
df = df.sort_values(by=['Weights'] , ascending=False)
df.set_index('Dimension', inplace=True)
return df
# Map weights for the first principal component to corresponding feature names
# and then print the linked values, sorted by weight.
first_comp = weights(0, pca, df)
plt.figure(figsize=(10, len(first_comp) * 0.3)) # Adjust figure size based on number of dimensions
plt.bar(first_comp.index, first_comp['Weights'])
plt.title(f'Weights of 1 Principal Component')
plt.xlabel('Weight')
plt.ylabel('Dimension')
plt.xticks(rotation=90)
plt.tight_layout()
plt.show()
# Map weights for the second principal component to corresponding feature names
# and then print the linked values, sorted by weight.
second_comp = weights(1, pca, df)
plt.figure(figsize=(10, len(second_comp) * 0.3)) # Adjust figure size based on number of dimensions
plt.bar(second_comp.index, second_comp['Weights'])
plt.title(f'Weights of 2 Principal Components')
plt.xlabel('Weight')
plt.ylabel('Dimension')
plt.xticks(rotation=90)
plt.tight_layout()
plt.show()
# Map weights for the third principal component to corresponding feature names
# and then print the linked values, sorted by weight.
third_component = weights(2, pca, df)
plt.figure(figsize=(10, len(third_component) * 0.3)) # Adjust figure size based on number of dimensions
plt.bar(third_component.index, third_component['Weights'])
plt.title(f'Weights of 3 Principal Components')
plt.xlabel('Weight')
plt.ylabel('Dimension')
plt.xticks(rotation=90)
plt.tight_layout()
plt.show()
final_comp = weights(35, pca, df) #being 0 index, our final component(37) == 36
plt.figure(figsize=(10, len(final_comp) * 0.3)) # Adjust figure size based on number of dimensions
plt.bar(final_comp.index, final_comp['Weights'])
plt.title(f'Weights of 3 Principal Components')
plt.xlabel('Weight')
plt.ylabel('Dimension')
plt.xticks(rotation=90)
plt.tight_layout()
plt.show()
final_comp
| Weights | |
|---|---|
| Dimension | |
| SEMIO_DOM | 0.425884 |
| SEMIO_PFLICHT | 0.269351 |
| KKK | 0.235192 |
| movement | 0.200948 |
| RELAT_AB | 0.190000 |
| SEMIO_VERT | 0.180109 |
| SEMIO_TRADV | 0.160221 |
| KBA05_GBZ | 0.133485 |
| HH_EINKOMMEN_SCORE | 0.125398 |
| ALTERSKATEGORIE_GROB | 0.083941 |
| SEMIO_LUST | 0.071323 |
| SEMIO_KULT | 0.069172 |
| SEMIO_KRIT | 0.047356 |
| SEMIO_MAT | 0.045609 |
| FINANZ_SPARER | 0.043139 |
| PLZ8_GBZ | 0.039872 |
| FINANZ_ANLEGER | 0.036410 |
| FINANZ_UNAUFFAELLIGER | 0.020877 |
| GEBAEUDETYP_RASTER | 0.018348 |
| VERS_TYP | 0.011514 |
| ONLINE_AFFINITAET | 0.007423 |
| life_stage | 0.007267 |
| W_KEIT_KIND_HH | 0.006190 |
| PLZ8_ANTG3 | 0.004076 |
| PLZ8_HHZ | 0.002626 |
| RETOURTYP_BK_S | -0.000599 |
| CAMEO_INTL_2015 | -0.002522 |
| PLZ8_ANTG4 | -0.015840 |
| wealth | -0.025718 |
| FINANZ_MINIMALIST | -0.025776 |
| KONSUMNAEHE | -0.027935 |
| PRAEGENDE_JUGENDJAHRE | -0.036516 |
| MOBI_REGIO | -0.038148 |
| FINANZ_HAUSBAUER | -0.040972 |
| SEMIO_SOZ | -0.045268 |
| PLZ8_ANTG2 | -0.050656 |
| KBA05_ANTG2 | -0.054157 |
| BALLRAUM | -0.055670 |
| FINANZ_VORSORGER | -0.065035 |
| ANREDE_KZ | -0.075384 |
| HEALTH_TYP | -0.082011 |
| PLZ8_ANTG1 | -0.099785 |
| ORTSGR_KLS9 | -0.121589 |
| SEMIO_REL | -0.122525 |
| REGIOTYP | -0.135097 |
| SEMIO_KAEM | -0.163151 |
| EWDICHTE | -0.168451 |
| ARBEIT | -0.175954 |
| generation | -0.176797 |
| KBA05_ANTG1 | -0.181449 |
| INNENSTADT | -0.196824 |
| SEMIO_ERL | -0.238687 |
| SEMIO_RAT | -0.262656 |
| SEMIO_FAM | -0.297620 |
Discussion 2.3: Interpret Principal Components¶
First Component Weights:In this, FINANZ_MINIMALIST have the heightest weight, denoting Low finicial interest investorsSecond Component Weights:For this also, PRAEGENDE_JUGENDJAHRE has the heighest values prioritising movement across generationsThird Component Weights:Here, SEMIO_KAEM tells about how Belligerent the customer isFinal Component Weights:Finally, the dominant minded(SEMIO_DOM), the dutiful(SEMIO_PFLICHT) and then people with very high purhcasing power(KKK)
These show there isnt any correlation between the different principal component weights
Step 3: Clustering¶
Step 3.1: Apply Clustering to General Population¶
You've assessed and cleaned the demographics data, then scaled and transformed them. Now, it's time to see how the data clusters in the principal components space. In this substep, you will apply k-means clustering to the dataset and use the average within-cluster distances from each point to their assigned cluster's centroid to decide on a number of clusters to keep.
- Use sklearn's KMeans class to perform k-means clustering on the PCA-transformed data.
- Then, compute the average difference from each point to its assigned cluster's center. Hint: The KMeans object's
.score()method might be useful here, but note that in sklearn, scores tend to be defined so that larger is better. Try applying it to a small, toy dataset, or use an internet search to help your understanding. - Perform the above two steps for a number of different cluster counts. You can then see how the average distance decreases with an increasing number of clusters. However, each additional cluster provides a smaller net benefit. Use this fact to select a final number of clusters in which to group the data. Warning: because of the large size of the dataset, it can take a long time for the algorithm to resolve. The more clusters to fit, the longer the algorithm will take. You should test for cluster counts through at least 10 clusters to get the full picture, but you shouldn't need to test for a number of clusters above about 30.
- Once you've selected a final number of clusters to use, re-fit a KMeans instance to perform the clustering operation. Make sure that you also obtain the cluster assignments for the general demographics data, since you'll be using them in the final Step 3.3.
pca, input_pca = apply_pca(36, df)
input_pca
array([[-1.4945402 , 1.14600198, 1.7340266 , ..., 0.08530578,
0.12782479, -0.04777715],
[-4.43718914, 2.02626803, 1.95999335, ..., -0.16507349,
-0.48791817, 0.48001781],
[ 0.6756197 , 0.51984735, 3.04428583, ..., -0.03045957,
0.37642888, 0.42620528],
...,
[ 0.86775767, 3.89047834, 1.84754791, ..., 0.59030547,
0.17706409, 0.1290213 ],
[-6.325277 , 1.46714174, -3.27781371, ..., -0.41757441,
0.03695333, 0.06304165],
[ 2.02143688, -1.7472236 , -3.38823799, ..., 1.08966956,
0.27419976, 0.07838945]])
Elbow_M = KElbowVisualizer(KMeans(), k=36) #using base
# Fit the visualizer to the PCA-transformed data
Elbow_M.fit(input_pca)
# Display the Elbow Method plot
Elbow_M.show()
<Axes: title={'center': 'Distortion Score Elbow for KMeans Clustering'}, xlabel='k', ylabel='distortion score'>
cluster_range = range(1, 30) # Try 1 to 30 clusters
avg_distances = []
for k in cluster_range:
kmeans = KMeans(n_clusters=k, random_state=42)
kmeans.fit(input_pca)
distances = np.min(kmeans.transform(input_pca), axis=1)
avg_distance = np.mean(distances)
avg_distances.append(avg_distance)
# Print results
for k, dist in zip(cluster_range, avg_distances):
print(f"Clusters: {k}, Average distance: {dist:.2f}")
Clusters: 1, Average distance: 7.09 Clusters: 2, Average distance: 6.57 Clusters: 3, Average distance: 6.22 Clusters: 4, Average distance: 6.02 Clusters: 5, Average distance: 5.84 Clusters: 6, Average distance: 5.59 Clusters: 7, Average distance: 5.46 Clusters: 8, Average distance: 5.34 Clusters: 9, Average distance: 5.26 Clusters: 10, Average distance: 5.22 Clusters: 11, Average distance: 5.16 Clusters: 12, Average distance: 5.13 Clusters: 13, Average distance: 5.08 Clusters: 14, Average distance: 5.05 Clusters: 15, Average distance: 5.02 Clusters: 16, Average distance: 5.00 Clusters: 17, Average distance: 4.97 Clusters: 18, Average distance: 4.95 Clusters: 19, Average distance: 4.94 Clusters: 20, Average distance: 4.90 Clusters: 21, Average distance: 4.87 Clusters: 22, Average distance: 4.85 Clusters: 23, Average distance: 4.83 Clusters: 24, Average distance: 4.82 Clusters: 25, Average distance: 4.81 Clusters: 26, Average distance: 4.79 Clusters: 27, Average distance: 4.78 Clusters: 28, Average distance: 4.76 Clusters: 29, Average distance: 4.75
cluster_range = range(1, 40) # Try 1 to 40 clusters too
avg_distances = []
for k in cluster_range:
kmeans = KMeans(n_clusters=k, random_state=42)
kmeans.fit(input_pca)
distances = np.min(kmeans.transform(input_pca), axis=1)
avg_distance = np.mean(distances)
avg_distances.append(avg_distance)
# Plot the results
plt.figure(figsize=(10, 6))
plt.plot(cluster_range, avg_distances, marker='o')
plt.xlabel('Number of Clusters')
plt.ylabel('Average Within-Cluster Distance')
plt.title('K-Means Clustering: Average Within-Cluster Distance')
plt.grid(True)
plt.show()
plt.scatter(input_pca[:, 0], input_pca[:, 1], cmap='viridis')
centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='black', s=200, alpha=0.5);
/var/folders/y1/2cln1s193dgc6xy3yzry761r0000gn/T/ipykernel_845/3272324055.py:1: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored plt.scatter(input_pca[:, 0], input_pca[:, 1], cmap='viridis')
KNN Prediction on Population Data¶
kmeans = KMeans(n_clusters=11, random_state=42) #as advised by both yellow bricks' distortion score estimation
model = kmeans.fit(input_pca)
pop_model_pred = model.predict(input_pca)
pop_model_pred
array([ 0, 7, 10, ..., 4, 3, 9], dtype=int32)
Discussion 3.1: Apply Clustering to General Population¶
- Our Elbow Plot gives roughly a 11 number of clusters to use really.
- After reseaarching, we used yellowbrick's visualizer as read in this medium post.
In alll, we have gone for Yellow-bricks Elbow value because:
- The automatic detection of the elbow point is a valuable feature that would require additional coding if done manually.
- The inclusion of multiple metrics provides a more robust analysis
Step 3.2: Apply All Steps to the Customer Data¶
Now that you have clusters and cluster centers for the general population, it's time to see how the customer data maps on to those clusters. Take care to not confuse this for re-fitting all of the models to the customer data. Instead, you're going to use the fits from the general population to clean, transform, and cluster the customer data. In the last step of the project, you will interpret how the general population fits apply to the customer data.
- Don't forget when loading in the customers data, that it is semicolon (
;) delimited. - Apply the same feature wrangling, selection, and engineering steps to the customer demographics using the
clean_data()function you created earlier. (You can assume that the customer demographics data has similar meaning behind missing data patterns as the general demographics data.) - Use the sklearn objects from the general demographics data, and apply their transformations to the customers data. That is, you should not be using a
.fit()or.fit_transform()method to re-fit the old objects, nor should you be creating new sklearn objects! Carry the data through the feature scaling, PCA, and clustering steps, obtaining cluster assignments for all of the data in the customer demographics data.
# Load in the customer demographics data.
customers = pd.read_csv('Udacity_CUSTOMERS_Subset.csv', sep=';')
customers.head()
| AGER_TYP | ALTERSKATEGORIE_GROB | ANREDE_KZ | CJT_GESAMTTYP | FINANZ_MINIMALIST | FINANZ_SPARER | FINANZ_VORSORGER | FINANZ_ANLEGER | FINANZ_UNAUFFAELLIGER | FINANZ_HAUSBAUER | FINANZTYP | GEBURTSJAHR | GFK_URLAUBERTYP | GREEN_AVANTGARDE | HEALTH_TYP | LP_LEBENSPHASE_FEIN | LP_LEBENSPHASE_GROB | LP_FAMILIE_FEIN | LP_FAMILIE_GROB | LP_STATUS_FEIN | LP_STATUS_GROB | NATIONALITAET_KZ | PRAEGENDE_JUGENDJAHRE | RETOURTYP_BK_S | SEMIO_SOZ | SEMIO_FAM | SEMIO_REL | SEMIO_MAT | SEMIO_VERT | SEMIO_LUST | SEMIO_ERL | SEMIO_KULT | SEMIO_RAT | SEMIO_KRIT | SEMIO_DOM | SEMIO_KAEM | SEMIO_PFLICHT | SEMIO_TRADV | SHOPPER_TYP | SOHO_KZ | TITEL_KZ | VERS_TYP | ZABEOTYP | ALTER_HH | ANZ_PERSONEN | ANZ_TITEL | HH_EINKOMMEN_SCORE | KK_KUNDENTYP | W_KEIT_KIND_HH | WOHNDAUER_2008 | ANZ_HAUSHALTE_AKTIV | ANZ_HH_TITEL | GEBAEUDETYP | KONSUMNAEHE | MIN_GEBAEUDEJAHR | OST_WEST_KZ | WOHNLAGE | CAMEO_DEUG_2015 | CAMEO_DEU_2015 | CAMEO_INTL_2015 | KBA05_ANTG1 | KBA05_ANTG2 | KBA05_ANTG3 | KBA05_ANTG4 | KBA05_BAUMAX | KBA05_GBZ | BALLRAUM | EWDICHTE | INNENSTADT | GEBAEUDETYP_RASTER | KKK | MOBI_REGIO | ONLINE_AFFINITAET | REGIOTYP | KBA13_ANZAHL_PKW | PLZ8_ANTG1 | PLZ8_ANTG2 | PLZ8_ANTG3 | PLZ8_ANTG4 | PLZ8_BAUMAX | PLZ8_HHZ | PLZ8_GBZ | ARBEIT | ORTSGR_KLS9 | RELAT_AB | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2 | 4 | 1 | 5.0 | 5 | 1 | 5 | 1 | 2 | 2 | 2 | 0 | 4.0 | 1 | 1 | 20.0 | 5.0 | 2.0 | 2.0 | 10.0 | 5.0 | 1 | 4 | 5.0 | 6 | 5 | 2 | 6 | 6 | 7 | 3 | 4 | 1 | 3 | 1 | 1 | 2 | 1 | 3 | 0.0 | 0.0 | 1 | 3 | 10.0 | 2.0 | 0.0 | 1.0 | NaN | 6.0 | 9.0 | 1.0 | 0.0 | 1.0 | 5.0 | 1992.0 | W | 7.0 | 1 | 1A | 13 | 2.0 | 2.0 | 0.0 | 0.0 | 0.0 | 4.0 | 3.0 | 2.0 | 4.0 | 4.0 | 1.0 | 4.0 | 3.0 | 1.0 | 1201.0 | 3.0 | 3.0 | 1.0 | 0.0 | 1.0 | 5.0 | 5.0 | 1.0 | 2.0 | 1.0 |
| 1 | -1 | 4 | 1 | NaN | 5 | 1 | 5 | 1 | 3 | 2 | 2 | 0 | NaN | 0 | 1 | NaN | NaN | NaN | NaN | NaN | NaN | 1 | 0 | NaN | 3 | 6 | 2 | 6 | 7 | 5 | 3 | 4 | 1 | 3 | 3 | 2 | 4 | 1 | 3 | 0.0 | 0.0 | 1 | 3 | 11.0 | 3.0 | 0.0 | NaN | NaN | 0.0 | 9.0 | NaN | NaN | NaN | 5.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 2 | -1 | 4 | 2 | 2.0 | 5 | 1 | 5 | 1 | 4 | 4 | 2 | 0 | 3.0 | 1 | 2 | 13.0 | 3.0 | 1.0 | 1.0 | 10.0 | 5.0 | 1 | 4 | 5.0 | 2 | 2 | 1 | 3 | 3 | 7 | 7 | 1 | 2 | 7 | 5 | 6 | 4 | 1 | 1 | 0.0 | 0.0 | 2 | 3 | 6.0 | 1.0 | 0.0 | 1.0 | NaN | 6.0 | 9.0 | 1.0 | 0.0 | 8.0 | 1.0 | 1992.0 | W | 2.0 | 5 | 5D | 34 | 2.0 | 2.0 | 0.0 | 0.0 | 0.0 | 3.0 | 7.0 | 4.0 | 1.0 | 3.0 | 3.0 | 3.0 | 1.0 | 7.0 | 433.0 | 2.0 | 3.0 | 3.0 | 1.0 | 3.0 | 3.0 | 2.0 | 3.0 | 5.0 | 3.0 |
| 3 | 1 | 4 | 1 | 2.0 | 5 | 1 | 5 | 2 | 1 | 2 | 6 | 0 | 10.0 | 0 | 2 | 0.0 | 0.0 | 0.0 | 0.0 | 9.0 | 4.0 | 1 | 1 | 3.0 | 6 | 5 | 3 | 4 | 7 | 5 | 3 | 4 | 3 | 3 | 3 | 3 | 3 | 4 | 0 | 0.0 | 0.0 | 1 | 1 | 8.0 | 0.0 | 0.0 | 4.0 | NaN | NaN | 9.0 | 0.0 | NaN | 2.0 | 2.0 | 1992.0 | W | 7.0 | 4 | 4C | 24 | 3.0 | 0.0 | 0.0 | 0.0 | 1.0 | 4.0 | 7.0 | 1.0 | 7.0 | 4.0 | 3.0 | 4.0 | 2.0 | 6.0 | 755.0 | 3.0 | 2.0 | 1.0 | 0.0 | 1.0 | 3.0 | 4.0 | 1.0 | 3.0 | 1.0 |
| 4 | -1 | 3 | 1 | 6.0 | 3 | 1 | 4 | 4 | 5 | 2 | 2 | 1960 | 2.0 | 0 | 3 | 31.0 | 10.0 | 10.0 | 5.0 | 1.0 | 1.0 | 1 | 8 | 5.0 | 4 | 5 | 4 | 6 | 5 | 6 | 4 | 5 | 5 | 3 | 5 | 2 | 5 | 4 | 1 | 0.0 | 0.0 | 2 | 1 | 20.0 | 4.0 | 0.0 | 6.0 | 2.0 | 2.0 | 9.0 | 7.0 | 0.0 | 3.0 | 1.0 | 1992.0 | W | 3.0 | 7 | 7B | 41 | 0.0 | 3.0 | 2.0 | 0.0 | 0.0 | 3.0 | 3.0 | 4.0 | 4.0 | 3.0 | 4.0 | 3.0 | 5.0 | 7.0 | 513.0 | 2.0 | 4.0 | 2.0 | 1.0 | 2.0 | 3.0 | 3.0 | 3.0 | 5.0 | 1.0 |
customers.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 191652 entries, 0 to 191651 Data columns (total 85 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 AGER_TYP 191652 non-null int64 1 ALTERSKATEGORIE_GROB 191652 non-null int64 2 ANREDE_KZ 191652 non-null int64 3 CJT_GESAMTTYP 188439 non-null float64 4 FINANZ_MINIMALIST 191652 non-null int64 5 FINANZ_SPARER 191652 non-null int64 6 FINANZ_VORSORGER 191652 non-null int64 7 FINANZ_ANLEGER 191652 non-null int64 8 FINANZ_UNAUFFAELLIGER 191652 non-null int64 9 FINANZ_HAUSBAUER 191652 non-null int64 10 FINANZTYP 191652 non-null int64 11 GEBURTSJAHR 191652 non-null int64 12 GFK_URLAUBERTYP 188439 non-null float64 13 GREEN_AVANTGARDE 191652 non-null int64 14 HEALTH_TYP 191652 non-null int64 15 LP_LEBENSPHASE_FEIN 188439 non-null float64 16 LP_LEBENSPHASE_GROB 188439 non-null float64 17 LP_FAMILIE_FEIN 188439 non-null float64 18 LP_FAMILIE_GROB 188439 non-null float64 19 LP_STATUS_FEIN 188439 non-null float64 20 LP_STATUS_GROB 188439 non-null float64 21 NATIONALITAET_KZ 191652 non-null int64 22 PRAEGENDE_JUGENDJAHRE 191652 non-null int64 23 RETOURTYP_BK_S 188439 non-null float64 24 SEMIO_SOZ 191652 non-null int64 25 SEMIO_FAM 191652 non-null int64 26 SEMIO_REL 191652 non-null int64 27 SEMIO_MAT 191652 non-null int64 28 SEMIO_VERT 191652 non-null int64 29 SEMIO_LUST 191652 non-null int64 30 SEMIO_ERL 191652 non-null int64 31 SEMIO_KULT 191652 non-null int64 32 SEMIO_RAT 191652 non-null int64 33 SEMIO_KRIT 191652 non-null int64 34 SEMIO_DOM 191652 non-null int64 35 SEMIO_KAEM 191652 non-null int64 36 SEMIO_PFLICHT 191652 non-null int64 37 SEMIO_TRADV 191652 non-null int64 38 SHOPPER_TYP 191652 non-null int64 39 SOHO_KZ 145056 non-null float64 40 TITEL_KZ 145056 non-null float64 41 VERS_TYP 191652 non-null int64 42 ZABEOTYP 191652 non-null int64 43 ALTER_HH 145056 non-null float64 44 ANZ_PERSONEN 145056 non-null float64 45 ANZ_TITEL 145056 non-null float64 46 HH_EINKOMMEN_SCORE 188684 non-null float64 47 KK_KUNDENTYP 79715 non-null float64 48 W_KEIT_KIND_HH 137910 non-null float64 49 WOHNDAUER_2008 145056 non-null float64 50 ANZ_HAUSHALTE_AKTIV 141725 non-null float64 51 ANZ_HH_TITEL 139542 non-null float64 52 GEBAEUDETYP 141725 non-null float64 53 KONSUMNAEHE 145001 non-null float64 54 MIN_GEBAEUDEJAHR 141725 non-null float64 55 OST_WEST_KZ 141725 non-null object 56 WOHNLAGE 141725 non-null float64 57 CAMEO_DEUG_2015 141224 non-null object 58 CAMEO_DEU_2015 141224 non-null object 59 CAMEO_INTL_2015 141224 non-null object 60 KBA05_ANTG1 135672 non-null float64 61 KBA05_ANTG2 135672 non-null float64 62 KBA05_ANTG3 135672 non-null float64 63 KBA05_ANTG4 135672 non-null float64 64 KBA05_BAUMAX 135672 non-null float64 65 KBA05_GBZ 135672 non-null float64 66 BALLRAUM 141693 non-null float64 67 EWDICHTE 141693 non-null float64 68 INNENSTADT 141693 non-null float64 69 GEBAEUDETYP_RASTER 141725 non-null float64 70 KKK 137392 non-null float64 71 MOBI_REGIO 135672 non-null float64 72 ONLINE_AFFINITAET 188439 non-null float64 73 REGIOTYP 137392 non-null float64 74 KBA13_ANZAHL_PKW 140371 non-null float64 75 PLZ8_ANTG1 138888 non-null float64 76 PLZ8_ANTG2 138888 non-null float64 77 PLZ8_ANTG3 138888 non-null float64 78 PLZ8_ANTG4 138888 non-null float64 79 PLZ8_BAUMAX 138888 non-null float64 80 PLZ8_HHZ 138888 non-null float64 81 PLZ8_GBZ 138888 non-null float64 82 ARBEIT 141176 non-null float64 83 ORTSGR_KLS9 141176 non-null float64 84 RELAT_AB 141176 non-null float64 dtypes: float64(49), int64(32), object(4) memory usage: 124.3+ MB
customers.isna().sum()
AGER_TYP 0 ALTERSKATEGORIE_GROB 0 ANREDE_KZ 0 CJT_GESAMTTYP 3213 FINANZ_MINIMALIST 0 FINANZ_SPARER 0 FINANZ_VORSORGER 0 FINANZ_ANLEGER 0 FINANZ_UNAUFFAELLIGER 0 FINANZ_HAUSBAUER 0 FINANZTYP 0 GEBURTSJAHR 0 GFK_URLAUBERTYP 3213 GREEN_AVANTGARDE 0 HEALTH_TYP 0 LP_LEBENSPHASE_FEIN 3213 LP_LEBENSPHASE_GROB 3213 LP_FAMILIE_FEIN 3213 LP_FAMILIE_GROB 3213 LP_STATUS_FEIN 3213 LP_STATUS_GROB 3213 NATIONALITAET_KZ 0 PRAEGENDE_JUGENDJAHRE 0 RETOURTYP_BK_S 3213 SEMIO_SOZ 0 SEMIO_FAM 0 SEMIO_REL 0 SEMIO_MAT 0 SEMIO_VERT 0 SEMIO_LUST 0 SEMIO_ERL 0 SEMIO_KULT 0 SEMIO_RAT 0 SEMIO_KRIT 0 SEMIO_DOM 0 SEMIO_KAEM 0 SEMIO_PFLICHT 0 SEMIO_TRADV 0 SHOPPER_TYP 0 SOHO_KZ 46596 TITEL_KZ 46596 VERS_TYP 0 ZABEOTYP 0 ALTER_HH 46596 ANZ_PERSONEN 46596 ANZ_TITEL 46596 HH_EINKOMMEN_SCORE 2968 KK_KUNDENTYP 111937 W_KEIT_KIND_HH 53742 WOHNDAUER_2008 46596 ANZ_HAUSHALTE_AKTIV 49927 ANZ_HH_TITEL 52110 GEBAEUDETYP 49927 KONSUMNAEHE 46651 MIN_GEBAEUDEJAHR 49927 OST_WEST_KZ 49927 WOHNLAGE 49927 CAMEO_DEUG_2015 50428 CAMEO_DEU_2015 50428 CAMEO_INTL_2015 50428 KBA05_ANTG1 55980 KBA05_ANTG2 55980 KBA05_ANTG3 55980 KBA05_ANTG4 55980 KBA05_BAUMAX 55980 KBA05_GBZ 55980 BALLRAUM 49959 EWDICHTE 49959 INNENSTADT 49959 GEBAEUDETYP_RASTER 49927 KKK 54260 MOBI_REGIO 55980 ONLINE_AFFINITAET 3213 REGIOTYP 54260 KBA13_ANZAHL_PKW 51281 PLZ8_ANTG1 52764 PLZ8_ANTG2 52764 PLZ8_ANTG3 52764 PLZ8_ANTG4 52764 PLZ8_BAUMAX 52764 PLZ8_HHZ 52764 PLZ8_GBZ 52764 ARBEIT 50476 ORTSGR_KLS9 50476 RELAT_AB 50476 dtype: int64
Preprocessing the Customers Data¶
cust = clean_data(customers) #using cleaning function
cust.head()
| ALTERSKATEGORIE_GROB | ANREDE_KZ | FINANZ_MINIMALIST | FINANZ_SPARER | FINANZ_VORSORGER | FINANZ_ANLEGER | FINANZ_UNAUFFAELLIGER | FINANZ_HAUSBAUER | GREEN_AVANTGARDE | HEALTH_TYP | PRAEGENDE_JUGENDJAHRE | RETOURTYP_BK_S | SEMIO_SOZ | SEMIO_FAM | SEMIO_REL | SEMIO_MAT | SEMIO_VERT | SEMIO_LUST | SEMIO_ERL | SEMIO_KULT | SEMIO_RAT | SEMIO_KRIT | SEMIO_DOM | SEMIO_KAEM | SEMIO_PFLICHT | SEMIO_TRADV | VERS_TYP | HH_EINKOMMEN_SCORE | W_KEIT_KIND_HH | KONSUMNAEHE | CAMEO_INTL_2015 | KBA05_ANTG1 | KBA05_ANTG2 | KBA05_GBZ | BALLRAUM | EWDICHTE | INNENSTADT | GEBAEUDETYP_RASTER | KKK | MOBI_REGIO | ONLINE_AFFINITAET | REGIOTYP | PLZ8_ANTG1 | PLZ8_ANTG2 | PLZ8_ANTG3 | PLZ8_ANTG4 | PLZ8_HHZ | PLZ8_GBZ | ARBEIT | ORTSGR_KLS9 | RELAT_AB | wealth | life_stage | generation | movement | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 4.0 | 1 | 5 | 1 | 5 | 1 | 2 | 2 | 1 | 1.0 | 4.0 | 5.0 | 6 | 5 | 2 | 6 | 6 | 7 | 3 | 4 | 1 | 3 | 1 | 1 | 2 | 1 | 1.0 | 1.0 | 6.0 | 5.0 | 13.0 | 2.0 | 2.0 | 4.0 | 3.0 | 2.0 | 4.0 | 4.0 | 1.0 | 4.0 | 3.0 | 1.0 | 3.0 | 3.0 | 1.0 | 0.0 | 5.0 | 5.0 | 1.0 | 2.0 | 1.0 | 0 | 2 | 1 | 1 |
| 1 | 4.0 | 1 | 5 | 1 | 5 | 1 | 3 | 2 | 0 | 1.0 | NaN | NaN | 3 | 6 | 2 | 6 | 7 | 5 | 3 | 4 | 1 | 3 | 3 | 2 | 4 | 1 | 1.0 | NaN | NaN | 5.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 4 | 4 | 5 | 1 |
| 2 | 4.0 | 2 | 5 | 1 | 5 | 1 | 4 | 4 | 1 | 2.0 | 4.0 | 5.0 | 2 | 2 | 1 | 3 | 3 | 7 | 7 | 1 | 2 | 7 | 5 | 6 | 4 | 1 | 2.0 | 1.0 | 6.0 | 1.0 | 34.0 | 2.0 | 2.0 | 3.0 | 7.0 | 4.0 | 1.0 | 3.0 | 3.0 | 3.0 | 1.0 | 7.0 | 2.0 | 3.0 | 3.0 | 1.0 | 3.0 | 2.0 | 3.0 | 5.0 | 3.0 | 2 | 3 | 1 | 1 |
| 3 | 4.0 | 1 | 5 | 1 | 5 | 2 | 1 | 2 | 0 | 2.0 | 1.0 | 3.0 | 6 | 5 | 3 | 4 | 7 | 5 | 3 | 4 | 3 | 3 | 3 | 3 | 3 | 4 | 1.0 | 4.0 | NaN | 2.0 | 24.0 | 3.0 | 0.0 | 4.0 | 7.0 | 1.0 | 7.0 | 4.0 | 3.0 | 4.0 | 2.0 | 6.0 | 3.0 | 2.0 | 1.0 | 0.0 | 3.0 | 4.0 | 1.0 | 3.0 | 1.0 | 1 | 3 | 0 | 0 |
| 4 | 3.0 | 1 | 3 | 1 | 4 | 4 | 5 | 2 | 0 | 3.0 | 8.0 | 5.0 | 4 | 5 | 4 | 6 | 5 | 6 | 4 | 5 | 5 | 3 | 5 | 2 | 5 | 4 | 2.0 | 6.0 | 2.0 | 1.0 | 41.0 | 0.0 | 3.0 | 3.0 | 3.0 | 4.0 | 4.0 | 3.0 | 4.0 | 3.0 | 5.0 | 7.0 | 2.0 | 4.0 | 2.0 | 1.0 | 3.0 | 3.0 | 3.0 | 5.0 | 1.0 | 3 | 0 | 3 | 0 |
cust.shape
(191652, 55)
cust.isna().sum()
ALTERSKATEGORIE_GROB 240 ANREDE_KZ 0 FINANZ_MINIMALIST 0 FINANZ_SPARER 0 FINANZ_VORSORGER 0 FINANZ_ANLEGER 0 FINANZ_UNAUFFAELLIGER 0 FINANZ_HAUSBAUER 0 GREEN_AVANTGARDE 0 HEALTH_TYP 48990 PRAEGENDE_JUGENDJAHRE 48487 RETOURTYP_BK_S 3213 SEMIO_SOZ 0 SEMIO_FAM 0 SEMIO_REL 0 SEMIO_MAT 0 SEMIO_VERT 0 SEMIO_LUST 0 SEMIO_ERL 0 SEMIO_KULT 0 SEMIO_RAT 0 SEMIO_KRIT 0 SEMIO_DOM 0 SEMIO_KAEM 0 SEMIO_PFLICHT 0 SEMIO_TRADV 0 VERS_TYP 48990 HH_EINKOMMEN_SCORE 2968 W_KEIT_KIND_HH 56937 KONSUMNAEHE 46651 CAMEO_INTL_2015 50554 KBA05_ANTG1 55980 KBA05_ANTG2 55980 KBA05_GBZ 55982 BALLRAUM 49959 EWDICHTE 49959 INNENSTADT 49959 GEBAEUDETYP_RASTER 49927 KKK 60064 MOBI_REGIO 55980 ONLINE_AFFINITAET 3213 REGIOTYP 60064 PLZ8_ANTG1 52764 PLZ8_ANTG2 52764 PLZ8_ANTG3 52764 PLZ8_ANTG4 52764 PLZ8_HHZ 52764 PLZ8_GBZ 52764 ARBEIT 50499 ORTSGR_KLS9 50476 RELAT_AB 50499 wealth 0 life_stage 0 generation 0 movement 0 dtype: int64
def add_missing_dummy_columns(d, columns):
missing_cols = set(columns) - set(d.columns)
print(missing_cols)
for c in missing_cols:
d[c] = 0
def fix_columns(d, columns):
add_missing_dummy_columns(d, columns)
# make sure we have all the columns we need
assert(set(columns) - set(d.columns) == set())
extra_cols = set(d.columns) - set(columns)
if extra_cols:
print("extra columns:", extra_cols)
d = d[columns]
return d
customers = fix_columns(cust, df.columns.tolist())
set()
extra columns: {'GREEN_AVANTGARDE'}
cust = cust.drop(['GREEN_AVANTGARDE'], axis=1)
#inputing nan values as in the Population Data
cust_imp = pd.DataFrame(imp_median.transform(cust), columns=cust.columns)
#verifying nan count
cust_imp.isna().sum().sum()
0
#standerdizing using scaler predefined earlier
cust_scaled = pd.DataFrame(scaler.transform(cust_imp), columns=cust_imp.columns)
#reconfirming PCA value
pca?
Type: PCA String form: PCA(n_components=36) File: ~/opt/anaconda3/lib/python3.11/site-packages/sklearn/decomposition/_pca.py Docstring: Principal component analysis (PCA). Linear dimensionality reduction using Singular Value Decomposition of the data to project it to a lower dimensional space. The input data is centered but not scaled for each feature before applying the SVD. It uses the LAPACK implementation of the full SVD or a randomized truncated SVD by the method of Halko et al. 2009, depending on the shape of the input data and the number of components to extract. With sparse inputs, the ARPACK implementation of the truncated SVD can be used (i.e. through :func:`scipy.sparse.linalg.svds`). Alternatively, one may consider :class:`TruncatedSVD` where the data are not centered. Notice that this class only supports sparse inputs for some solvers such as "arpack" and "covariance_eigh". See :class:`TruncatedSVD` for an alternative with sparse data. For a usage example, see :ref:`sphx_glr_auto_examples_decomposition_plot_pca_iris.py` Read more in the :ref:`User Guide <PCA>`. Parameters ---------- n_components : int, float or 'mle', default=None Number of components to keep. if n_components is not set all components are kept:: n_components == min(n_samples, n_features) If ``n_components == 'mle'`` and ``svd_solver == 'full'``, Minka's MLE is used to guess the dimension. Use of ``n_components == 'mle'`` will interpret ``svd_solver == 'auto'`` as ``svd_solver == 'full'``. If ``0 < n_components < 1`` and ``svd_solver == 'full'``, select the number of components such that the amount of variance that needs to be explained is greater than the percentage specified by n_components. If ``svd_solver == 'arpack'``, the number of components must be strictly less than the minimum of n_features and n_samples. Hence, the None case results in:: n_components == min(n_samples, n_features) - 1 copy : bool, default=True If False, data passed to fit are overwritten and running fit(X).transform(X) will not yield the expected results, use fit_transform(X) instead. whiten : bool, default=False When True (False by default) the `components_` vectors are multiplied by the square root of n_samples and then divided by the singular values to ensure uncorrelated outputs with unit component-wise variances. Whitening will remove some information from the transformed signal (the relative variance scales of the components) but can sometime improve the predictive accuracy of the downstream estimators by making their data respect some hard-wired assumptions. svd_solver : {'auto', 'full', 'covariance_eigh', 'arpack', 'randomized'}, default='auto' "auto" : The solver is selected by a default 'auto' policy is based on `X.shape` and `n_components`: if the input data has fewer than 1000 features and more than 10 times as many samples, then the "covariance_eigh" solver is used. Otherwise, if the input data is larger than 500x500 and the number of components to extract is lower than 80% of the smallest dimension of the data, then the more efficient "randomized" method is selected. Otherwise the exact "full" SVD is computed and optionally truncated afterwards. "full" : Run exact full SVD calling the standard LAPACK solver via `scipy.linalg.svd` and select the components by postprocessing "covariance_eigh" : Precompute the covariance matrix (on centered data), run a classical eigenvalue decomposition on the covariance matrix typically using LAPACK and select the components by postprocessing. This solver is very efficient for n_samples >> n_features and small n_features. It is, however, not tractable otherwise for large n_features (large memory footprint required to materialize the covariance matrix). Also note that compared to the "full" solver, this solver effectively doubles the condition number and is therefore less numerical stable (e.g. on input data with a large range of singular values). "arpack" : Run SVD truncated to `n_components` calling ARPACK solver via `scipy.sparse.linalg.svds`. It requires strictly `0 < n_components < min(X.shape)` "randomized" : Run randomized SVD by the method of Halko et al. .. versionadded:: 0.18.0 .. versionchanged:: 1.5 Added the 'covariance_eigh' solver. tol : float, default=0.0 Tolerance for singular values computed by svd_solver == 'arpack'. Must be of range [0.0, infinity). .. versionadded:: 0.18.0 iterated_power : int or 'auto', default='auto' Number of iterations for the power method computed by svd_solver == 'randomized'. Must be of range [0, infinity). .. versionadded:: 0.18.0 n_oversamples : int, default=10 This parameter is only relevant when `svd_solver="randomized"`. It corresponds to the additional number of random vectors to sample the range of `X` so as to ensure proper conditioning. See :func:`~sklearn.utils.extmath.randomized_svd` for more details. .. versionadded:: 1.1 power_iteration_normalizer : {'auto', 'QR', 'LU', 'none'}, default='auto' Power iteration normalizer for randomized SVD solver. Not used by ARPACK. See :func:`~sklearn.utils.extmath.randomized_svd` for more details. .. versionadded:: 1.1 random_state : int, RandomState instance or None, default=None Used when the 'arpack' or 'randomized' solvers are used. Pass an int for reproducible results across multiple function calls. See :term:`Glossary <random_state>`. .. versionadded:: 0.18.0 Attributes ---------- components_ : ndarray of shape (n_components, n_features) Principal axes in feature space, representing the directions of maximum variance in the data. Equivalently, the right singular vectors of the centered input data, parallel to its eigenvectors. The components are sorted by decreasing ``explained_variance_``. explained_variance_ : ndarray of shape (n_components,) The amount of variance explained by each of the selected components. The variance estimation uses `n_samples - 1` degrees of freedom. Equal to n_components largest eigenvalues of the covariance matrix of X. .. versionadded:: 0.18 explained_variance_ratio_ : ndarray of shape (n_components,) Percentage of variance explained by each of the selected components. If ``n_components`` is not set then all components are stored and the sum of the ratios is equal to 1.0. singular_values_ : ndarray of shape (n_components,) The singular values corresponding to each of the selected components. The singular values are equal to the 2-norms of the ``n_components`` variables in the lower-dimensional space. .. versionadded:: 0.19 mean_ : ndarray of shape (n_features,) Per-feature empirical mean, estimated from the training set. Equal to `X.mean(axis=0)`. n_components_ : int The estimated number of components. When n_components is set to 'mle' or a number between 0 and 1 (with svd_solver == 'full') this number is estimated from input data. Otherwise it equals the parameter n_components, or the lesser value of n_features and n_samples if n_components is None. n_samples_ : int Number of samples in the training data. noise_variance_ : float The estimated noise covariance following the Probabilistic PCA model from Tipping and Bishop 1999. See "Pattern Recognition and Machine Learning" by C. Bishop, 12.2.1 p. 574 or http://www.miketipping.com/papers/met-mppca.pdf. It is required to compute the estimated data covariance and score samples. Equal to the average of (min(n_features, n_samples) - n_components) smallest eigenvalues of the covariance matrix of X. n_features_in_ : int Number of features seen during :term:`fit`. .. versionadded:: 0.24 feature_names_in_ : ndarray of shape (`n_features_in_`,) Names of features seen during :term:`fit`. Defined only when `X` has feature names that are all strings. .. versionadded:: 1.0 See Also -------- KernelPCA : Kernel Principal Component Analysis. SparsePCA : Sparse Principal Component Analysis. TruncatedSVD : Dimensionality reduction using truncated SVD. IncrementalPCA : Incremental Principal Component Analysis. References ---------- For n_components == 'mle', this class uses the method from: `Minka, T. P.. "Automatic choice of dimensionality for PCA". In NIPS, pp. 598-604 <https://tminka.github.io/papers/pca/minka-pca.pdf>`_ Implements the probabilistic PCA model from: `Tipping, M. E., and Bishop, C. M. (1999). "Probabilistic principal component analysis". Journal of the Royal Statistical Society: Series B (Statistical Methodology), 61(3), 611-622. <http://www.miketipping.com/papers/met-mppca.pdf>`_ via the score and score_samples methods. For svd_solver == 'arpack', refer to `scipy.sparse.linalg.svds`. For svd_solver == 'randomized', see: :doi:`Halko, N., Martinsson, P. G., and Tropp, J. A. (2011). "Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions". SIAM review, 53(2), 217-288. <10.1137/090771806>` and also :doi:`Martinsson, P. G., Rokhlin, V., and Tygert, M. (2011). "A randomized algorithm for the decomposition of matrices". Applied and Computational Harmonic Analysis, 30(1), 47-68. <10.1016/j.acha.2010.02.003>` Examples -------- >>> import numpy as np >>> from sklearn.decomposition import PCA >>> X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]]) >>> pca = PCA(n_components=2) >>> pca.fit(X) PCA(n_components=2) >>> print(pca.explained_variance_ratio_) [0.9924... 0.0075...] >>> print(pca.singular_values_) [6.30061... 0.54980...] >>> pca = PCA(n_components=2, svd_solver='full') >>> pca.fit(X) PCA(n_components=2, svd_solver='full') >>> print(pca.explained_variance_ratio_) [0.9924... 0.00755...] >>> print(pca.singular_values_) [6.30061... 0.54980...] >>> pca = PCA(n_components=1, svd_solver='arpack') >>> pca.fit(X) PCA(n_components=1, svd_solver='arpack') >>> print(pca.explained_variance_ratio_) [0.99244...] >>> print(pca.singular_values_) [6.30061...]
#applying pca now
cust_pca = pca.transform(cust_scaled)
/Users/chrismo/opt/anaconda3/lib/python3.11/site-packages/sklearn/base.py:486: UserWarning: X has feature names, but PCA was fitted without feature names warnings.warn(
Applying Clustering from General Population Data unto the Customer Data¶
model?
Type: KMeans String form: KMeans(n_clusters=11, random_state=42) File: ~/opt/anaconda3/lib/python3.11/site-packages/sklearn/cluster/_kmeans.py Docstring: K-Means clustering. Read more in the :ref:`User Guide <k_means>`. Parameters ---------- n_clusters : int, default=8 The number of clusters to form as well as the number of centroids to generate. For an example of how to choose an optimal value for `n_clusters` refer to :ref:`sphx_glr_auto_examples_cluster_plot_kmeans_silhouette_analysis.py`. init : {'k-means++', 'random'}, callable or array-like of shape (n_clusters, n_features), default='k-means++' Method for initialization: * 'k-means++' : selects initial cluster centroids using sampling based on an empirical probability distribution of the points' contribution to the overall inertia. This technique speeds up convergence. The algorithm implemented is "greedy k-means++". It differs from the vanilla k-means++ by making several trials at each sampling step and choosing the best centroid among them. * 'random': choose `n_clusters` observations (rows) at random from data for the initial centroids. * If an array is passed, it should be of shape (n_clusters, n_features) and gives the initial centers. * If a callable is passed, it should take arguments X, n_clusters and a random state and return an initialization. For an example of how to use the different `init` strategy, see the example entitled :ref:`sphx_glr_auto_examples_cluster_plot_kmeans_digits.py`. n_init : 'auto' or int, default='auto' Number of times the k-means algorithm is run with different centroid seeds. The final results is the best output of `n_init` consecutive runs in terms of inertia. Several runs are recommended for sparse high-dimensional problems (see :ref:`kmeans_sparse_high_dim`). When `n_init='auto'`, the number of runs depends on the value of init: 10 if using `init='random'` or `init` is a callable; 1 if using `init='k-means++'` or `init` is an array-like. .. versionadded:: 1.2 Added 'auto' option for `n_init`. .. versionchanged:: 1.4 Default value for `n_init` changed to `'auto'`. max_iter : int, default=300 Maximum number of iterations of the k-means algorithm for a single run. tol : float, default=1e-4 Relative tolerance with regards to Frobenius norm of the difference in the cluster centers of two consecutive iterations to declare convergence. verbose : int, default=0 Verbosity mode. random_state : int, RandomState instance or None, default=None Determines random number generation for centroid initialization. Use an int to make the randomness deterministic. See :term:`Glossary <random_state>`. copy_x : bool, default=True When pre-computing distances it is more numerically accurate to center the data first. If copy_x is True (default), then the original data is not modified. If False, the original data is modified, and put back before the function returns, but small numerical differences may be introduced by subtracting and then adding the data mean. Note that if the original data is not C-contiguous, a copy will be made even if copy_x is False. If the original data is sparse, but not in CSR format, a copy will be made even if copy_x is False. algorithm : {"lloyd", "elkan"}, default="lloyd" K-means algorithm to use. The classical EM-style algorithm is `"lloyd"`. The `"elkan"` variation can be more efficient on some datasets with well-defined clusters, by using the triangle inequality. However it's more memory intensive due to the allocation of an extra array of shape `(n_samples, n_clusters)`. .. versionchanged:: 0.18 Added Elkan algorithm .. versionchanged:: 1.1 Renamed "full" to "lloyd", and deprecated "auto" and "full". Changed "auto" to use "lloyd" instead of "elkan". Attributes ---------- cluster_centers_ : ndarray of shape (n_clusters, n_features) Coordinates of cluster centers. If the algorithm stops before fully converging (see ``tol`` and ``max_iter``), these will not be consistent with ``labels_``. labels_ : ndarray of shape (n_samples,) Labels of each point inertia_ : float Sum of squared distances of samples to their closest cluster center, weighted by the sample weights if provided. n_iter_ : int Number of iterations run. n_features_in_ : int Number of features seen during :term:`fit`. .. versionadded:: 0.24 feature_names_in_ : ndarray of shape (`n_features_in_`,) Names of features seen during :term:`fit`. Defined only when `X` has feature names that are all strings. .. versionadded:: 1.0 See Also -------- MiniBatchKMeans : Alternative online implementation that does incremental updates of the centers positions using mini-batches. For large scale learning (say n_samples > 10k) MiniBatchKMeans is probably much faster than the default batch implementation. Notes ----- The k-means problem is solved using either Lloyd's or Elkan's algorithm. The average complexity is given by O(k n T), where n is the number of samples and T is the number of iteration. The worst case complexity is given by O(n^(k+2/p)) with n = n_samples, p = n_features. Refer to :doi:`"How slow is the k-means method?" D. Arthur and S. Vassilvitskii - SoCG2006.<10.1145/1137856.1137880>` for more details. In practice, the k-means algorithm is very fast (one of the fastest clustering algorithms available), but it falls in local minima. That's why it can be useful to restart it several times. If the algorithm stops before fully converging (because of ``tol`` or ``max_iter``), ``labels_`` and ``cluster_centers_`` will not be consistent, i.e. the ``cluster_centers_`` will not be the means of the points in each cluster. Also, the estimator will reassign ``labels_`` after the last iteration to make ``labels_`` consistent with ``predict`` on the training set. Examples -------- >>> from sklearn.cluster import KMeans >>> import numpy as np >>> X = np.array([[1, 2], [1, 4], [1, 0], ... [10, 2], [10, 4], [10, 0]]) >>> kmeans = KMeans(n_clusters=2, random_state=0, n_init="auto").fit(X) >>> kmeans.labels_ array([1, 1, 1, 0, 0, 0], dtype=int32) >>> kmeans.predict([[0, 0], [12, 3]]) array([1, 0], dtype=int32) >>> kmeans.cluster_centers_ array([[10., 2.], [ 1., 2.]]) For a more detailed example of K-Means using the iris dataset see :ref:`sphx_glr_auto_examples_cluster_plot_cluster_iris.py`. For examples of common problems with K-Means and how to address them see :ref:`sphx_glr_auto_examples_cluster_plot_kmeans_assumptions.py`. For an example of how to use K-Means to perform color quantization see :ref:`sphx_glr_auto_examples_cluster_plot_color_quantization.py`. For a demonstration of how K-Means can be used to cluster text documents see :ref:`sphx_glr_auto_examples_text_plot_document_clustering.py`. For a comparison between K-Means and MiniBatchKMeans refer to example :ref:`sphx_glr_auto_examples_cluster_plot_mini_batch_kmeans.py`.
cust_model_pred = model.predict(cust_pca) #same model that trained our population df
cust_model_pred
array([8, 9, 6, ..., 8, 6, 1], dtype=int32)
# Apply preprocessing, feature transformation, and clustering from the general
# demographics onto the customer data, obtaining cluster predictions for the
# customer demographics data.
Step 3.3: Compare Customer Data to Demographics Data¶
At this point, you have clustered data based on demographics of the general population of Germany, and seen how the customer data for a mail-order sales company maps onto those demographic clusters. In this final substep, you will compare the two cluster distributions to see where the strongest customer base for the company is.
Consider the proportion of persons in each cluster for the general population, and the proportions for the customers. If we think the company's customer base to be universal, then the cluster assignment proportions should be fairly similar between the two. If there are only particular segments of the population that are interested in the company's products, then we should see a mismatch from one to the other. If there is a higher proportion of persons in a cluster for the customer data compared to the general population (e.g. 5% of persons are assigned to a cluster for the general population, but 15% of the customer data is closest to that cluster's centroid) then that suggests the people in that cluster to be a target audience for the company. On the other hand, the proportion of the data in a cluster being larger in the general population than the customer data (e.g. only 2% of customers closest to a population centroid that captures 6% of the data) suggests that group of persons to be outside of the target demographics.
Take a look at the following points in this step:
- Compute the proportion of data points in each cluster for the general population and the customer data. Visualizations will be useful here: both for the individual dataset proportions, but also to visualize the ratios in cluster representation between groups. Seaborn's
countplot()orbarplot()function could be handy.- Recall the analysis you performed in step 1.1.3 of the project, where you separated out certain data points from the dataset if they had more than a specified threshold of missing values. If you found that this group was qualitatively different from the main bulk of the data, you should treat this as an additional data cluster in this analysis. Make sure that you account for the number of data points in this subset, for both the general population and customer datasets, when making your computations!
- Which cluster or clusters are overrepresented in the customer dataset compared to the general population? Select at least one such cluster and infer what kind of people might be represented by that cluster. Use the principal component interpretations from step 2.3 or look at additional components to help you make this inference. Alternatively, you can use the
.inverse_transform()method of the PCA and StandardScaler objects to transform centroids back to the original data space and interpret the retrieved values directly. - Perform a similar investigation for the underrepresented clusters. Which cluster or clusters are underrepresented in the customer dataset compared to the general population, and what kinds of people are typified by these clusters?
# Define the desired clusters
desired_clusters = [0,1,2,3,4,5,6,7,8,9,10,11] # Add up to 11 clusters
# Filter data to include only desired clusters
cust__pred = pd.Series(cust_model_pred)
pop__pred = pd.Series(pop_model_pred)
# Compare the proportion of data in each cluster for the customer data to the
# proportion of data in each cluster for the general population.
# Create the figure and subplots
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6))
# Plot customer data
sns.countplot(x=cust_model_pred, palette='Set2', ax=ax1, hue=cust_model_pred, legend=False)
ax1.set_title('Customer Population Data')
ax1.set_xlabel('Cluster')
ax1.set_ylabel('Count')
# Plot general demographic data
sns.countplot(x=pop_model_pred, palette='Set2', ax=ax2, hue=pop_model_pred, legend=False)
ax2.set_title('General Population Data')
ax2.set_xlabel('Cluster')
ax2.set_ylabel('Count')
# Adjust layout and display the plot
plt.tight_layout()
plt.show();
# Create the figure
plt.figure(figsize=(10, 6))
# Plot customer data
sns.countplot(x=cust_model_pred, palette='Set3', order=desired_clusters, label='Customer Data', alpha=0.6)
# Plot general demographic data
sns.countplot(x=pop_model_pred, palette='Set2', order=desired_clusters, label='General Demographic', alpha=0.6)
# Add labels and title
plt.xlabel('Cluster')
plt.ylabel('Count')
plt.title('Overlap of Customer Data and then Population Data')
# Display the plot
plt.tight_layout()
plt.show();
/var/folders/y1/2cln1s193dgc6xy3yzry761r0000gn/T/ipykernel_845/2898750846.py:5: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. sns.countplot(x=cust_model_pred, palette='Set3', order=desired_clusters, label='Customer Data', alpha=0.6) /var/folders/y1/2cln1s193dgc6xy3yzry761r0000gn/T/ipykernel_845/2898750846.py:8: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. sns.countplot(x=pop_model_pred, palette='Set2', order=desired_clusters, label='General Demographic', alpha=0.6)
# Generating proportion for the customer clusters
labels, values = zip(*Counter(cust_model_pred).items())
v = list(values)
v[:] = [x/len(cust_model_pred) for x in v]
indexes = np.arange(len(labels))
# Generating proportion for the azdias clusters
labels1, values1 = zip(*Counter(pop_model_pred).items())
v1 = list(values1)
v1[:] = [x/len(pop_model_pred) for x in v1]
# Configuring the graph
width = 0.35 # Width of each bar
spacing = 0.2 # Space between clusters
plt.figure(figsize=(12, 8))
# Calculate new x-coordinates for grouped bars
x = np.arange(0, len(labels) * (2*width + spacing), 2*width + spacing)
# Plot bars
plt.bar(x, v1, width, label='Population Data')
plt.bar(x + width, v, width, color='r', label='Customer Data')
# Adjust x-axis ticks and labels
plt.xticks(x + width/2, labels)
plt.legend(loc='best')
plt.xlabel('Cluster')
plt.ylabel('% of Total Customers')
plt.grid()
plt.show()
From the looks of it really, the Overrepresented bits are the 0, 8th and then the 9th clusters respectively¶
# What kinds of people are part of a cluster that is overrepresented in the
# customer data compared to the general population?
# Exploring the 0th cluster now
cluster_0 = model.cluster_centers_[0]
pca_0 = pca.inverse_transform(cluster_0)
pca_0_2d = np.array(pca_0).reshape(1, -1) # Reshape to 2D
overrepresented = scaler.inverse_transform(pca_0_2d)
overrepresented = pd.Series(data=overrepresented.flatten(), index=list(df.keys()))
display(overrepresented.head(45))
ALTERSKATEGORIE_GROB 2.643979 ANREDE_KZ 1.432840 FINANZ_MINIMALIST 3.791299 FINANZ_SPARER 3.165392 FINANZ_VORSORGER 3.888655 FINANZ_ANLEGER 4.629506 FINANZ_UNAUFFAELLIGER 4.793987 FINANZ_HAUSBAUER 2.779347 HEALTH_TYP 1.803794 PRAEGENDE_JUGENDJAHRE 6.630160 RETOURTYP_BK_S 3.621399 SEMIO_SOZ 2.084913 SEMIO_FAM 5.961617 SEMIO_REL 6.840431 SEMIO_MAT 4.830767 SEMIO_VERT 0.967002 SEMIO_LUST 5.912300 SEMIO_ERL 3.181842 SEMIO_KULT 2.933584 SEMIO_RAT 3.303255 SEMIO_KRIT 7.044412 SEMIO_DOM 5.636096 SEMIO_KAEM 5.587260 SEMIO_PFLICHT 4.245575 SEMIO_TRADV 2.413348 VERS_TYP 2.032865 HH_EINKOMMEN_SCORE 1.074767 W_KEIT_KIND_HH 4.056209 KONSUMNAEHE 3.055724 CAMEO_INTL_2015 37.486204 KBA05_ANTG1 1.445589 KBA05_ANTG2 1.012477 KBA05_GBZ 3.341329 BALLRAUM 5.114782 EWDICHTE 3.981913 INNENSTADT 5.196747 GEBAEUDETYP_RASTER 4.060258 KKK 2.776842 MOBI_REGIO 3.455386 ONLINE_AFFINITAET 2.390382 REGIOTYP 4.700004 PLZ8_ANTG1 2.218213 PLZ8_ANTG2 2.952106 PLZ8_ANTG3 1.834660 PLZ8_ANTG4 0.865878 dtype: float64
Also, for the Underrepresented lot, the 4th, 7th and 3rd clusters are what should be considered¶
# What kinds of people are part of a cluster that is underrepresented in the
# customer data compared to the general population?
cluster_4 = model.cluster_centers_[4]
pca_4 = pca.inverse_transform(cluster_4)
pca_4_2d = np.array(pca_4).reshape(1, -1) # Reshape to 2D
underrepresented = scaler.inverse_transform(pca_4_2d)
overrepresented = pd.Series(data=underrepresented.flatten(), index=list(df.keys()))
display(overrepresented.head(45))
ALTERSKATEGORIE_GROB 1.858449 ANREDE_KZ 1.813097 FINANZ_MINIMALIST 3.428030 FINANZ_SPARER 2.995665 FINANZ_VORSORGER 3.317424 FINANZ_ANLEGER 3.797548 FINANZ_UNAUFFAELLIGER 3.782066 FINANZ_HAUSBAUER 2.576793 HEALTH_TYP 2.040646 PRAEGENDE_JUGENDJAHRE 9.487140 RETOURTYP_BK_S 2.984537 SEMIO_SOZ 2.588261 SEMIO_FAM 3.902026 SEMIO_REL 5.146993 SEMIO_MAT 5.045156 SEMIO_VERT 2.614274 SEMIO_LUST 4.432167 SEMIO_ERL 4.083262 SEMIO_KULT 3.721611 SEMIO_RAT 5.147265 SEMIO_KRIT 5.566548 SEMIO_DOM 5.657643 SEMIO_KAEM 5.348109 SEMIO_PFLICHT 4.998034 SEMIO_TRADV 4.802555 VERS_TYP 1.496673 HH_EINKOMMEN_SCORE 2.735408 W_KEIT_KIND_HH 3.275667 KONSUMNAEHE 3.702680 CAMEO_INTL_2015 24.496840 KBA05_ANTG1 2.608642 KBA05_ANTG2 1.171091 KBA05_GBZ 3.994225 BALLRAUM 5.074914 EWDICHTE 3.054661 INNENSTADT 5.599150 GEBAEUDETYP_RASTER 4.095579 KKK 2.506862 MOBI_REGIO 4.101765 ONLINE_AFFINITAET 3.515877 REGIOTYP 4.029731 PLZ8_ANTG1 2.914575 PLZ8_ANTG2 2.458729 PLZ8_ANTG3 1.020368 PLZ8_ANTG4 0.294508 dtype: float64
Discussion 3.3: Compare Customer Data to Demographics Data¶
- From the diagrams we can conclude that customers are not universal to the general demographic. There is rarely any relationship between all even
The clusters where the mail order company should target are basically the 0nd, 8th and 9th clusters - overrepresented data¶
Using the 0th Cluster, these few set are denoted
- People between ages 46-60
- High money savers
- Men
- People who have low financial investments interest
The underrepresented data, as denoted in the 4th, 7th and 3rd clusters, tell a different story¶
Usint the 4th cluster specifically, being our most underrepresented few:
- These are people between ages 30-45
- The Female Gender
- People who have average financial interests
Congratulations on making it this far in the project! Before you finish, make sure to check through the entire notebook from top to bottom to make sure that your analysis follows a logical flow and all of your findings are documented in Discussion cells. Once you've checked over all of your work, you should export the notebook as an HTML document to submit for evaluation. You can do this from the menu, navigating to File -> Download as -> HTML (.html). You will submit both that document and this notebook for your project submission.